1 Introduction

Topic models, such as Latent Dirichlet Allocation (LDA), and its extensions provide a powerful method for inferring latent topics in document corpus (Blei et al. 2003; Gao et al. 2017). The standard unsupervised topic models extract the hidden structures from corpus based on the basic assumptions that each topic is represented as a multinomial distribution over a given vocabulary and each document is represented as a multinomial distribution over these topics (Li et al. 2018). However, many researchers have found that the unsupervised models often result in incoherent topics, which constitutes the primary obstacle to acceptance of these models in applications (Chang et al. 2009; Mimno et al. 2011).

The key reason to generate incoherent topics for the above standard topic models lies in that the objective functions of these models do not always correlate well with human judgments (Chang et al. 2009). Specifically, the above unsupervised models assume the topic-word distribution follows a Dirichlet prior, which will result in words under each document to be uncorrelated and generated independently (Ahmed et al. 2017). As a result, the topic modeling process will ignore lexical semantic information between words to learn meaningful and coherent topics, which are accordance with human cognition (Blei and Lafferty 2005). Knowledge-based topic models have been demonstrated to be an effective strategy to deal with the above problem (Petterson et al. 2010).

Several knowledge-based topic models have been proposed by researchers in recent years (Fu et al. 2018; Xu et al. 2018). Early knowledge-based model asks users to provide prior domain knowledge to extract more coherent topics (Andrzejewski et al. 2009). In these models, users will be requested to provide two kinds of prior lexical semantic knowledge in the form of must-links and cannot-links (Jagarlamudi et al. 2012). A must-link means that two words should belong to the same topic and a cannot-link means that two words should not be in the same topic. Compared to the popular unsupervised topic models, combining the above two forms of word–word correlation knowledge generated by users can improve the performance of modeling to some extent (Hu et al. 2014). However, the key weakness exists in the above early knowledge-based models is to require users to provide prior domain knowledge. The quantity and quality of prior knowledge generated artificially will be limited as a user may have no idea what knowledge to provide and the knowledge generation process will be non-automatic or inefficient. In addition, early knowledge-based models lack the judgment mechanism of knowledge applicability and aim to incorporate word–word correlation in a hard and topic-independent way, which ignores the fact that whether two words are correlated depends on which topic they appear in (Xie et al. 2015).

To solve the problem of knowledge acquisition and knowledge judgment in early studies, researchers explore to mine knowledge of word–word correlation automatically and measure incorrect knowledge in modeling process based on statistical information of word co-occurrence in corpus (Chen and Liu 2014a, b). These models solve the problems in early knowledge-based topic models and improve the coherence of topic modeling results further (Shams and Baraani-Dastjerdi 2017). However, the limitations are obvious because these models need a large amount of related domain datasets to mine valuable prior knowledge, which is not applicable in practical applications. Given the fact that semantically related words may not co-occur frequently in text corpus, the knowledge mining method based on word co-occurrence information will generate limited knowledge, which will further increase dependence on massive amount of datasets. In addition, the evaluation mechanism of knowledge correctness in these models depends only on co-occurrence information of words and cannot capture the lexical semantic relations explicitly.

With the development of neural language models, word embeddings make it feasible and practical to represent words in a semantic vector space with the purpose of fully retaining the semantical and syntactical relations between words (Mikolov et al. 2013a). The characteristics of word embeddings provide an effective method to mine lexical semantic knowledge by simply calculating the similarity between words in continuous embedding space. Recently, more attention has been paid on how to incorporate word embeddings into topic models to produce more meaningful topics (Qiang et al. 2017; Xie et al. 2015). However, these existing models mainly explore to incorporate must-links based on word embeddings, but ignore the incorporation of cannot-links in topic modeling (Chen and Liu 2014b; Xie et al. 2015; Yao et al. 2017). The word–word correlation of cannot-link is completely different from must-links and imposes a completely different effect on topic assignment when topic modeling (Yang et al. 2015c). Incorporating and coordinating two different forms of word–word correlation derived from word embeddings in topic models simultaneously are critical to extract more coherent topics, but little attention has been paid on this direction.

In this paper, we propose a novel topic model—called Mixed Word Correlation Knowledge-based Latent Dirichlet Allocation (MWCK-LDA)—to combine both must-links and cannot-links based on word embeddings. The developed MWCK-LDA can not only incorporate both must-links and cannot-links in a topic-dependent way but also balance the effect of two different knowledge forms on topic sampling effectively. Compared to standard topic models that produce knowledge of word–word correlation based on users’ experience or word co-occurrence information in corpus, our model can mine more abundant knowledge automatically based on word embeddings, which is superior in both effectiveness and efficiency for knowledge generation. To incorporate and balance two completely different forms of lexical semantic knowledge, gaining insights from (Xie et al. 2015; Zhu and Xing 2010), a Mixed Markov Random Field is constructed over the latent topic layer to regularize the topic assignment of each word during topic modeling process, which will give must-links a better chance to be put into the same topic and cannot-links a better chance to be not. Compared to those models that only take must-links based on word vectors into consideration, our model can also incorporate more abundant cannot-links in modeling process simultaneously, which will combine more comprehensive lexical semantic information and further improve topics quality. In addition to the improved knowledge incorporation capabilities, our model also provides a soft mechanism to determine the applicability of must-links and cannot-links during topic modeling process, which is different from the hard and topic-independent judgment mechanism of knowledge applicability in standard knowledge-based models. Specifically, the mechanism in our model does not specify which topic a must-link or cannot-link should belong to or should not belong to directly, but leaves this to be assessed by text documents automatically. What is more, our model can balance the effect between two forms of prior knowledge and word co-occurrence information in corpus on topic assignment over each word in document. The main contributions of this paper include:

  1. 1.

    Proposing a novel knowledge-based topic model coordinating and balancing two different lexical semantic knowledge of must-links and cannot-links derived from word embeddings in a soft and balanced manner.

  2. 2.

    Providing a collapsed Gibbs sampling method to infer posterior distribution and estimate parameters of the proposed model, which could incorporate both must-links and cannot-links adaptively.

  3. 3.

    Extensive experiments demonstrate that the proposed model outperforms the state-of-the-art baseline models on two public benchmark datasets in terms of topic coherence in both qualitative and quantitative metrics.

The remainder of this paper is organized as follows. Section 2 shows related researches. Section 3 presents our proposed topic model. Section 4 discusses the experiments, and finally, Sect. 5 concludes.

2 Related work

The most related research to our work is the knowledge-based topic models aiming at incorporating external domain knowledge into modeling process to improve topic quality. In this section, we focus on existing works in knowledge-based models and give a brief summarization.

Several knowledge-based topic models have been proposed in recent years. For example, Andrzejewski et al. (2009) proposed a topic model using a Dirichlet forest to replace the Dirichlet as the prior over the topic-word multinomial distribution, which enables the developed model to encode the set of must-links and cannot-links. Chen et al. (2013b) proposed a topic model that could exploit the knowledge form of s-sets derived from multiple past domains to extract more coherent and meaningful topics in a new domain. The form of s-set is an extended knowledge form of must-links, which is composed of multiple words that should belong to the same topic. In practice, the knowledge applied by the above models is produced by users. However, asking users to provide prior domain knowledge to guide generative process of topic models can be problematic as users may not know what to provide (Lee et al. 2017). It also makes the topic modeling process non-automatic. In addition, these models incorporate the external domain knowledge into modeling process in a primitive way and lack the judgment mechanism of knowledge applicability, which limits the performance of topic modeling.

In order to deal with the above problems, based on word co-occurrence information in corpus some other researchers explore to mine prior knowledge automatically and eliminate incorrect knowledge. Chen et al. proposed LTM to extract must-links by frequent itemset mining (FIM) from prior topics which already inferred in past domains and use pointwise mutual information (PMI) to assess the applicability of mined knowledge in current topic modeling process automatically (Chen and Liu 2014b). As an extension of LTM, Chen et al. developed a new model which expanded the form of prior knowledge, incorporating both must-links and cannot-links into generative process using generalized PÓlya urn (GPU) model (Chen and Liu 2014a). The generation and correctness measurement of knowledge applied in these models are all based on statistical word co-occurrence information in documents, which makes these models have a superior performance than early knowledge-based topic models (Xu et al. 2018). However, the external knowledge incorporation strategy realized by GPU is essentially a hard and topic-independent way, which simply assumes correlated words should be similar in each topic probabilistically. Although researchers introduce word co-occurrence information to assess the correctness of knowledge to current modeling process, the evaluation mechanism of knowledge cannot capture the semantic relations between words effectively given that semantically related words may not co-occur frequently in document corpus (Yao et al. 2016). In addition, the statistical knowledge mining method will generate limited knowledge, making these models need a large amount of past datasets to provide abundant information for knowledge generation, which is always not applicable in majority topic modeling tasks. Yang et al. (2015c) presented Sparse Constrained LDA (SC-LDA) to incorporate prior knowledge of word–word correlation into LDA. In SC-LDA, must-links stem from synsets in Word-Net 3.0, which restricts the scope of knowledge generation. For example, “deep” and “learning” is not synonymous, but should constitute a pair of must-link when topic modeling. In addition, SC-LDA is unable to balance the effects of must-links and cannot-links during the generative process of topics. SC-LDA also ignores the fact that whether two words are correlated depends on which topic they appear in and lacks the judgment mechanism of knowledge applicability.

The development of word embeddings provides an effective strategy to represent words in continuous vector space, which can retain abundant syntactic and semantic information between words (Yang et al. 2015a). The continuous vector representations of words make it feasible to measure the lexical semantic similarity by the distance between words in the vector space (Mikolov et al. 2013a, b; Pennington et al. 2014). Recently, more attention has been paid on how to combine word embeddings into topic models to produce high-quality topics (Fang et al. 2016). Xie et al. (2015) proposed MRF-LDA to use must-links generated by word embeddings to improve the performance of topic modeling. Yao et al. (2017) developed WE-LDA to incorporate must-links by word vectors. However, existing works exploit only one knowledge form of must-links derived from word embeddings in models and ignore more extensive lexical semantic knowledge form of cannot-links. Both forms of the above word–word correlation are all essential to understand document contents. Given that cannot-link is a completely different knowledge form from must-link and plays a completely different role when topic sampling, it is critical to explore the strategy or mechanism to coordinate different kinds of knowledge effectively in topic models. In addition, some other researchers also explored to apply word embeddings in topic modeling tasks for short texts to alleviate the content sparsity problem of short texts (Xun et al. 2016; Yang et al. 2015b; Yao et al. 2016). These short text topic models only explored to use must-link knowledge. However, in this paper we focus on improving topic modeling performance for long documents.

3 The proposed MWCK-LDA model

The proposed model in this paper is based on LDA. In this section, we start with the introduction of LDA and then describe MWCK-LDA in detail along with its inference method. The notations and their corresponding meanings used in this paper are summarized in Table 1. To infer posterior probability and estimate parameters of our model, we derive the Gibbs sampler and give Gibbs sampling algorithm in Table 2.

Table 1 Notations used in this paper
Table 2 Algorithm of Gibbs sampling for MWCK-LDA

3.1 Brief review of LDA

In this section, we will briefly discuss the LDA proposed by Blei et al. LDA is a generative probabilistic model that aims at inferring latent topics from documents corpus. LDA follows two basic assumptions: (a) A document is assumed to be a multinomial distribution of topics and (b) a topic is assumed to be a multinomial distribution of words in the given vocabulary. The graph model of LDA is shown in Fig. 1a.

Fig. 1
figure 1

Graphical model of LDA (a) and MWCK-LDA (b)

Given a documents corpus containing M documents, the vocabulary derived from this corpus consists of V different words. We assume the number of latent topics contained in text documents is K. LDA is a probabilistic generative model. When generating the mth (\( m \in \left[ {1,M} \right] \)) document in corpus, LDA samples a document-topic multinomial distribution \( \vec{\theta }_{m} \) from a prior Dirichlet distribution with hyper-parameters \( \vec{\alpha } \): \( p\left( {\vec{\theta }_{m} |\vec{\alpha }} \right) = Dir(\vec{\theta }_{m} |\vec{\alpha }) \). \( \vec{\alpha } \) and \( \vec{\theta }_{m} \) are both K-dimensional vectors, and the elements of \( \vec{\theta }_{m} \) are satisfied with: \( \mathop \sum \nolimits_{k} \vec{\theta }_{m,k} = 1 \), k = 1, …, K. Then LDA assigns a latent topic \( z_{m,n} \) for each word \( w_{m,n} \) in document m based on the topic multinomial distribution \( \vec{\theta }_{m} \) of the mth document. n states the position of the word in document and satisfies collection \( n \in \left[ {1,N_{m} } \right] \) where \( N_{m} \) is the number of words in document m. LDA assumes all words is independent from each other in a document. For document m, the joint probability of topic assignments for all words is shown in formula (1) where \( \vec{z}_{m} \) characterizes topic assignments for all words in document.

$$ p\left( {\vec{z}_{m} |\vec{\theta }_{m} } \right) = \mathop \prod \limits_{n = 1}^{{N_{m} }} p\left( {z_{m,n} |\vec{\theta }_{m} } \right) $$
(1)

As discussed above, the kth topic is assumed to be a multinomial distribution over V words in the vocabulary. According to LDA, each topic-word multinomial distribution \( \vec{\varphi }_{k} \) follows a prior Dirichlet distribution with hyper-parameters \( \vec{\beta } \): \( p\left( {\emptyset |\vec{\beta }} \right) = Dir(\vec{\varphi }_{k} |\vec{\beta }) \) where both \( \vec{\varphi }_{k} \) and \( \vec{\beta } \) are V-dimensional vectors. \( \emptyset \) denotes a matrix with \( K \times V \) dimension containing all topics’ multinomial distributions over words. \( \vec{\varphi }_{k,w} \) denotes the probability of generating word w given topic k and it satisfies \( \mathop \sum \nolimits_{w} \vec{\varphi }_{k,w} = 1 \), where w = 1, …, V. Then each word in document m can be generated by sampling from the assigned multinomial distribution of latent topic: \( p\left( {w_{m,n} |z = z_{m,n} } \right) = \vec{\varphi }_{{z_{m,n} }} \). The generative process of LDA is listed as follows:

  1. 1.

    For kth topic, where k = 1, 2, …, K

    Draw \( \vec{\varphi }_{k} \sim Dir\left( {\vec{\beta }} \right) \)

  2. 2.

    For mth document, where m = 1, 2, …, M

    1. (a)

      Draw \( \vec{\theta }_{m} \sim Dir\left( {\vec{\alpha }} \right) \)

    2. (b)

      For nth word in mth document, where n = 1, 2, …, \( N_{m} \)

      • Draw \( z_{m,n} \sim Multi\left( {\vec{\theta }_{m} } \right) \)

      • Draw \( w_{m,n} \sim Multi\left( {\vec{\varphi }_{{z_{m,n} }} } \right) \)

Given a documents corpus, \( w_{m,n} \) is observable variable. \( \alpha \) and \( \beta \) are prior hyper-parameters. \( z_{m,n} \), \( \vec{\theta }_{m} \) and \( \vec{\varphi }_{{z_{m,n} }} \) are hidden variables which can be estimated by the observed words in corpus. The joint distribution of all variables is as follows:

$$ p\left( {\vec{w}_{m} ,\vec{z}_{m} ,\vec{\theta }_{m} ,\emptyset |\vec{\alpha },\vec{\beta }} \right) = \mathop \prod \limits_{n = 1}^{{N_{m} }} p\left( {w_{m,n} |\vec{\varphi }_{{z_{m,n} }} } \right) \cdot p\left( {z_{m,n} |\vec{\theta }_{m} } \right) \cdot p\left( {\vec{\theta }_{m} |\vec{\alpha }} \right) \cdot p\left( {\emptyset |\vec{\beta }} \right) $$
(2)

The hidden variables in the generative process can be approximated by applying the inference strategy of Markov chain Monte Carlo (MCMC). Specifically, it is implemented using the Gibbs sampling method. The parameter inference method conducted by Gibbs sampling in LDA was first proposed by Griffiths et al. and has been widely used in various parameters estimating tasks in the field of topic models (Griffiths and Steyvers 2004). In order to simplify the inference, we always assume the prior Dirichlet distributions in topic model are symmetrical Dirichlet distributions. Based on the above assumptions, we can derive the conditional distribution to sample a topic z for each word in corpus as follows:

$$ {\text{p}}\left( {z_{i} = k|\vec{z}_{\neg i} ,\vec{w}} \right) \propto \frac{{n_{m,\neg i}^{\left( k \right)} + \alpha }}{{\mathop \sum \nolimits_{k = 1}^{K} \left( {n_{m,\neg i}^{\left( k \right)} + \alpha } \right)}} \cdot \frac{{n_{k,\neg i}^{{\left( {w_{i} } \right)}} + \beta }}{{\mathop \sum \nolimits_{i = 1}^{V} \left( {n_{k,\neg i}^{{\left( {w_{i} } \right)}} + \beta } \right)}} $$
(3)

where \( n_{m,\neg i}^{\left( k \right)} \) is the number of topic k assigned to document m and \( n_{k,\neg i}^{{\left( {w_{i} } \right)}} \) is the number of word \( w_{i} \) assigned to topic k. Symbol \( \neg i \) denotes that the ith word is excluded from the counting. \( \alpha \) and \( \beta \) are hyper-parameters of two symmetric prior Dirichlet distributions, respectively.

3.2 MWCK-LDA

In this section, we will discuss how to incorporate lexical semantic correlation into learning process of topic model and present our proposed model, MWCK-LDA. As discussed above, standard LDA model assumes that topic-word multinomial distribution follows a prior Dirichlet distribution. Under this assumption, the words in the corpus are regarded as a bag-of-words model where any word is independent from each other, which ignores the effects of semantic correlation between words on topic assignment for each word in a document. Although this simplified assumption can improve the modeling efficiency of LDA, it makes the model generate incoherent topics in practice. The incorporation of prior lexical semantic correlation in topic model is of great significance for improving the quality of learned topics.

There are two forms of lexical semantic correlations, which can be divided into must-links and cannot-links. Must-links refer to positive semantic correlation between words, and they should belong to the same topic when topic modeling. Cannot-links refer to negative semantic correlation between words, and they should belong to different topics. For example, “game” and “team” are more likely to belong to the same topic, while “team” and “apple” should belong to different topics. It is critical to use and balance both must-links and cannot-links when topic inferring. Word embeddings aim at embedding the syntax information of words into a continuous vector space and representing a discrete word token using a vector, where words that are semantically related are close to each other in the distribution of the space. The vector representation of a word in word embeddings makes it possible to model the semantic relationship between words simply by measuring the similarity between word vectors. In this paper, we apply cosine similarity to calculate the correlation between words based on two thresholds \( \mu_{1} \) and \( \mu_{2} \). If the cosine distance between two words’ vectors is higher than \( \mu_{1} \), then the two words can be considered to construct a must-link pair indicating a positive semantic correlation exists between these two words. Otherwise, if the cosine similarity between two words’ vectors is lower than \( \mu_{2} \), the two words can be considered to construct a cannot-link pair indicating a negative semantic correlation exists between them.

The key idea of the MWCK-LDA in this paper is that if two words in a document construct a must-link they are more likely to belong to the same topic, and otherwise, if two words construct a cannot-link they are more likely to belong to different topics. The graph model of the proposed MWCK-LDA is depicted in Fig. 1b. In MWCK-LDA, we define a mechanism to coordinate must-links and cannot-links during topic modeling process by imposing the Mixed Markov Random Field over the latent topic layer. The Mixed Markov Random Field consists of two kinds of Markov Random Field (MRF) to incorporate different forms of lexical correlation knowledge, respectively, including semantically positively correlated MRF and semantically negatively correlated MRF. By adding this mechanism, the objective function of our proposed model will become more consistent with human judgments than existing models, which can be able to produce more coherent topics.

Given a document m containing \( N_{m} \) words:\( \left\{ {w_{m,n} } \right\}_{n = 1}^{{N_{m} }} \), traversing all words and calculating the cosine distances between any word pair based on their corresponding representations of word embeddings. If the distance of a word pair (\( w_{m,i} ,w_{m,j} \)) is higher than threshold \( \mu_{1} \), it indicates that the two words of the word pair can constitute a must-link and semantically positively correlated MRF will define a positive semantic undirected edge between their topic assignments (\( z_{m,i} ,z_{m,j} \)), which is represented by solid undirected line in Fig. 1b. On the contrary, if the distance of a word pair (\( w_{m,i} ,w_{m,j} \)) is lower than threshold \( \mu_{2} \), it indicates that the two words of the pair can constitute a cannot-link and semantically negatively correlated MRF will define a negative semantic undirected edge between their topic assignments (\( z_{m,i} ,z_{m,j} \)). In Fig. 1b, the negative semantic undirected edge is represented by dotted undirected line. After having traversed all word pairs contained in the mth document, the Mixed Markov Random Field creates two undirected graphs, \( G_{{P_{m} }} \) and \( G_{{N_{m} }} \). \( G_{{P_{m} }} \) consists of the connects between topic assignments corresponding to word pairs with must-link relationship, where latent topic assignments denote nodes and the connects between them denote edges. The set of all edges in the undirected graph \( G_{{P_{m} }} \) is represented by the notations of \( P_{m} \). Compared to \( G_{{P_{m} }} \), \( G_{{N_{m} }} \) consists of the connects between topic assignments corresponding to words pairs with cannot-link correlation, where nodes denote latent topic assignments and the connects between them denote edges represented by the notations of \( N_{m} \). In Fig. 1b, \( P_{m} = \left\{ {\left( {z_{m,1} ,z_{m,3} } \right), \left( {z_{m,3} ,z_{m,n} } \right), \ldots } \right\} \) and \( N_{m} = \left\{ {\left( {z_{m,2} ,z_{m,3} } \right), \left( {z_{m,2} ,z_{m,n} } \right), \ldots } \right\} \).

To encode semantic correlation between words, for each edge in \( P_{m} \) and \( N_{m} \), the Mixed Markov Random Field defines a binary potential to make words in a must-link tend to be assigned to the same topic, while words in a cannot-link tend not to be. Specifically, to encode must-link correlation, for each edge (\( z_{m,i} ,z_{m,j} \)) in \( P_{m} \) our strategy defines a binary potential as \( \exp \left\{ {I\left( {z_{m,i} = z_{m,j} } \right)} \right\} \), where \( I\left( \cdot \right) \) is the indicator function. Under this circumstance, if the latent topic assignments are the same, the binary potential function will generate a large value, and otherwise, the binary potential function will generate a small value if two topics are different. As a result, the potential function will increase the probability to assign the words in a must-link to the same topic. Then to encode cannot-link correlation, for each edge (\( z_{m,f} ,z_{m,g} \)) in \( N_{m} \), the proposed method defines another binary potential as \( \exp \left\{ {I\left( {z_{m,f} \ne z_{m,g} } \right)} \right\} \). If two topics are different, the binary potential function will yield a large value which leads to two words in a cannot-link that can be encouraged to be assigned to different topics. In our model, the joint probability of all topic assignments \( \left\{ {z_{m,n} } \right\}_{n = 1}^{{N_{m} }} \) in document m can be written as

$$ p\left( {\vec{z}_{m} |\vec{\theta }_{m} ,\lambda_{1} ,\lambda_{2} } \right) = \mathop \prod \limits_{n = 1}^{{N_{m} }} p\left( {z_{m,n} |\vec{\theta }_{m} } \right)\exp \left\{ {\lambda_{1} \frac{{\mathop \sum \nolimits_{{\left( {z_{m,i} ,z_{m,j} } \right) \in P_{m} }} I\left( {z_{m,i} = z_{m,j} } \right)}}{{\left| {P_{m} } \right|}}} \right\} \exp \left\{ {\lambda_{2} \frac{{\mathop \sum \nolimits_{{\left( {z_{m,f} ,z_{m,g} } \right) \in N_{m} }} I\left( {z_{m,f} \ne z_{m,g} } \right)}}{{\left| {N_{m} } \right|}}} \right\} $$
(4)

where \( \left| {P_{m} } \right| \) is the number of all edges in \( P_{m} \) and \( \left| {N_{m} } \right| \) is the number of all edges in \( N_{m} \). \( \lambda_{1} \) and \( \lambda_{2} \) are two hyper-parameters to be set by users to balance the effect of external knowledge and statistical information of word co-occurrence contained in corpus. From the joint probability of all topic assignments in document m, the sampling of latent topic for each word is the joint effect of multinomial topic distribution \( \vec{\theta }_{m} \) and word correlation knowledge in current document. The generative process of MWCK-LDA is described as follows.

  1. 1.

    For kth topic, where k = 1, 2, …, K

    Draw \( \vec{\varphi }_{k} \sim Dir\left( {\vec{\beta }} \right) \)

  2. 2.

    For mth document, where m = 1, 2, …, M

    1. (a)

      Draw \( \vec{\theta }_{m} \sim Dir\left( {\vec{\alpha }} \right) \)

    2. (b)

      For nth token in mth document, where n = 1, 2, …, \( N_{m} \)

      • Draw \( z_{m,n} \sim p\left( {\vec{z}_{m} |\vec{\theta }_{m} ,\lambda_{1} ,\lambda_{2} } \right) \)

      • Draw \( w_{m,n} \sim Multi\left( {\vec{\varphi }_{{z_{m,n} }} } \right) \)

In this paper, collapsed Gibbs sampling inference method is used to estimate parameters of the proposed model. Collapsed Gibbs sampling has been widely used in many topic models, and the derivation process of final Gibbs sampler for MWCK-LDA will be discussed in detail in the next section.

The derived approximate Gibbs sampler has the following conditional distribution.

$$ p\left( {z_{i} = k|\vec{z}_{\neg i} ,\vec{w}} \right) \propto \frac{{n_{m,\neg i}^{\left( k \right)} + \alpha }}{{\mathop \sum \nolimits_{k = 1}^{K} \left( {n_{m,\neg i}^{\left( k \right)} + \alpha } \right)}} \cdot \frac{{n_{k,\neg i}^{{\left( {w_{i} } \right)}} + \beta }}{{\mathop \sum \nolimits_{i = 1}^{V} \left( {n_{k,\neg i}^{{\left( {w_{i} } \right)}} + \beta } \right)}} \cdot \exp \left( {\lambda_{1} \frac{{\mathop \sum \nolimits_{{j \in ML_{m,i} }} \left( {z_{j} = k} \right)}}{{\left| {ML_{m,i} } \right|}}} \right) \cdot \exp \left( {\lambda_{2} \frac{{\mathop \sum \nolimits_{{j \in CL_{m,i} }} \left( {z_{j} \ne k} \right)}}{{\left| {CL_{m,i} } \right|}}} \right) $$
(5)

where \( n_{m,\neg i}^{\left( k \right)} \) is the number of words assigned to topic k in document m, \( n_{k,\neg i}^{{\left( {w_{i} } \right)}} \) is the number of times of word \( w_{i} \) assigned to topic k and \( \neg i \) denotes word \( w_{i} \) is excluded from counting in document m. \( ML_{m,i} \) denotes the set of words which are labeled to construct must-links with \( w_{i} \) in mth document and \( \left| {ML_{m,i} } \right| \) is the number of all words in \( ML_{m,i} \). \( CL_{m,i} \) denotes the set of words which are labeled to construct cannot-links with \( w_{i} \) in mth document and \( \left| {CL_{m,i} } \right| \) is the number of all words in \( CL_{m,i} \). \( \alpha \) and \( \beta \) are predefined Dirichlet hyper-parameters. K is the number of predefined latent topics. The details of Gibbs sampling process of MWCK-LDA are described in Algorithm 1. At the beginning, we randomly assign an initial topic for each word in the corpus and count \( n_{m}^{\left( k \right)} \), \( n_{k}^{{\left( {w_{m,n} } \right)}} \) and \( n_{k} \). Given the iterations number of Gibbs sampling I, a new latent topic is reassigned for each word in the corpus according to the derived approximate Gibbs sampler and then update the above parameters in each iteration.

As discussed in Sect. 3.1, standard LDA model follows the Dirichlet distributions to draw discrete document-topic distributions and topic-word distributions. The above assumption strategy is mostly motivated from the perspective of learning efficiency due to the conjugacy between the Dirichlet distribution and the multinomial distribution. However, using Dirichlet distributions as prior will ignore the correlations between tokens in the documents, and in fact, the tokens of a sample are independent. In addition, modeling the word co-occurrence patterns from texts corpus is hard due to the heavy tail nature of the vocabulary. Equation (3) is the sampler to infer hidden variables in standard topic model. From the sampler, we can observe the standard topic model mostly focuses on modeling irrelevant but common word co-occurrence patterns in the corpus, which cannot guide the model toward the intended state for the latent variables. On the contrary, if the distribution of the latent variables can capture lexical semantic correlation, it will be more accordance with human judgments. Based on the above analysis, it is critical to retain desired constraints and bias the latent distribution toward more intended state. Furthermore, there exists abundant lexical semantic correlation in documents corpus like must-links and cannot-links, which are accordance with human cognition. In our MWCK-LDA, to encode prior lexical correlation of must-links and cannot-links, the Mixed Markov Random Field is constructed over the latent topic layer in each document. Equation (4) is the joint probability for each document m of our method. From the likelihood of all topic assignments in document m, we can observe that the sampling of latent topic for each word is not independent, but retaining the desired lexical correlation knowledge in current document. The sampler of our model is listed in Eq. (5). Unlike standard topic model, we use the word–word correlation knowledge to influence the posterior distribution to bias the model to allocate similar words to the same topic and dissimilar words to the different topics. By imposing regularization on the posterior distribution of latent variables during topic modeling, we ensure desired constraints information retained in the learned model. As a result, this knowledge incorporation mechanism in our method can bias our model toward more accordance with human judgments.

3.3 Gibbs sampling method for MWCK-LDA

In this section, we will discuss how to derive the conditional distribution \( p\left( {z_{i} = k|\vec{z}_{\neg i} ,\vec{w}} \right) \) of approximate Gibbs sampler used in our MWCK-LDA model. According to the definition of conditional probability, \( p\left( {z_{i} = k|\vec{z}_{\neg i} ,\vec{w}} \right) \) can be obtained as follows

$$ p\left( {z_{i} = k|\vec{z}_{\neg i} ,\vec{w}} \right) = \frac{{p\left( {\vec{w},\vec{z}|\alpha ,\beta ,\lambda_{1} ,\lambda_{2} } \right)}}{{p\left( {\vec{w},\vec{z}_{\neg i} |\alpha ,\beta ,\lambda_{1} ,\lambda_{2} } \right)}} \propto \frac{{p\left( {\vec{w},\vec{z}|\alpha ,\beta ,\lambda_{1} ,\lambda_{2} } \right)}}{{p\left( {\vec{w}_{\neg i} ,\vec{z}_{\neg i} |\alpha ,\beta ,\lambda_{1} ,\lambda_{2} } \right)}} . $$
(6)

From Eq. (6), to derive the conditional distribution we need first to explore how to derive the distribution \( p\left( {\vec{w},\vec{z}|\alpha ,\beta ,\lambda_{1} ,\lambda_{2} } \right) \). The graphical model of MWCK-LDA in Fig. 1b shows \( p\left( {\vec{w},\vec{z}|\alpha ,\beta ,\lambda_{1} ,\lambda_{2} } \right) \) as follows:

$$ p\left( {\vec{w},\vec{z}|\alpha ,\beta ,\lambda_{1} ,\lambda_{2} } \right) = p\left( {\vec{w}|\vec{z},\beta } \right)p\left( {\vec{z}|\alpha ,\lambda_{1} ,\lambda_{2} } \right) . $$
(7)

As given in Eq. (7), deriving the distribution \( p\left( {\vec{w},\vec{z}|\alpha ,\beta ,\lambda_{1} ,\lambda_{2} } \right) \) can be translated to derive \( p\left( {\vec{w}|\vec{z},\beta } \right) \) and \( p\left( {\vec{z}|\alpha ,\lambda_{1} ,\lambda_{2} } \right) \). For \( p\left( {\vec{w}|\vec{z},\beta } \right) \) in MWCK-LDA, the derivation process is the same as that in LDA model. We can get \( p\left( {\vec{w}|\vec{z},\beta } \right) \) by integrating with respect to \( \emptyset \), \( p\left( {\vec{w}|\vec{z},\beta } \right) = \smallint p\left( {\vec{w}|\vec{z},\emptyset } \right)p\left( {\emptyset |\vec{\beta }} \right){\text{d}}\emptyset \). According to assumptions of topic model, \( p\left( {\emptyset |\vec{\beta }} \right) \) is a Dirichlet distribution and \( p\left( {\vec{w}|\vec{z},\emptyset } \right) \) is a multinomial distribution. Because of the conjugacy between Dirichlet distribution and multinomial distribution, \( p\left( {\vec{w}|\vec{z},\beta } \right) \) can be finally obtained as in Eq. (8).

$$ p\left( {\vec{w}|\vec{z},\beta } \right) = \smallint p\left( {\vec{w}|\vec{z},\emptyset } \right)p\left( {\emptyset |\vec{\beta }} \right){\text{d}}\emptyset = \mathop \prod \limits_{k = 1}^{K} \frac{{\Delta \left( {\vec{n}_{k} + \vec{\beta }} \right)}}{{\Delta \left( {\vec{\beta }} \right)}} $$
(8)

where \( \vec{n}_{k} = \left\{ {n_{k}^{{\left( {w_{i} } \right)}} } \right\}_{i = 1}^{V} \), \( n_{k}^{{\left( {w_{i} } \right)}} \) is the number of times that word \( w_{i} \) assigned to topic k. Here, we use the \( \Delta \) function defined by Heinrich (2005). The expression of \( \Delta \) function is described in Eq. (9).

$$ \Delta \left( {\vec{\beta }} \right) = \frac{{\mathop \prod \nolimits_{t = 1}^{V}\Gamma \left( \beta \right)}}{{\Gamma \left( {\mathop \sum \nolimits_{t = 1}^{V} \beta } \right)}} $$
(9)

Based on the definition of \( \Delta \) function, for \( \Delta \left( {\vec{n}_{k} + \vec{\beta }} \right) \) we have

$$ \Delta \left( {\vec{n}_{k} + \vec{\beta }} \right) = \frac{{\mathop \prod \nolimits_{t = 1}^{V} \varGamma \left( {n_{k}^{{\left( {w_{i} } \right)}} + \beta } \right)}}{{\varGamma \left( {\mathop \sum \nolimits_{t = 1}^{V} \left( {n_{k}^{{\left( {w_{i} } \right)}} + \beta } \right)} \right)}} = \frac{{\mathop \prod \nolimits_{t = 1}^{V} \varGamma \left( {n_{k}^{{\left( {w_{i} } \right)}} + \beta } \right)}}{{\varGamma \left( {n_{k} + V\beta } \right)}} $$
(10)

where \( n_{k} \) denotes the total number of words occur in topic k, \( n_{k} = \mathop \sum \nolimits_{i = 1}^{V} \left( {n_{k}^{{\left( {w_{i} } \right)}} } \right) \).

Then we investigate how to derive \( p\left( {\vec{z}|\alpha ,\lambda_{1} ,\lambda_{2} } \right) \). Similar to \( p\left( {\vec{w}|\vec{z},\beta } \right) \), we can get \( p\left( {\vec{z}|\alpha ,\lambda_{1} ,\lambda_{2} } \right) \) by integrating over \( \theta \), \( p\left( {\vec{z}|\alpha ,\lambda_{1} ,\lambda_{2} } \right) = \smallint p\left( {\vec{z}|\theta ,\lambda_{1} ,\lambda_{2} } \right)p\left( {\theta |\alpha } \right){\text{d}}\theta \). From Eq. (4), we can get

$$ p\left( {\vec{z}|\alpha ,\lambda_{1} ,\lambda_{2} } \right) = \smallint p\left( {\vec{z}|\theta } \right)\exp \left\{ {\lambda_{1} \frac{{\mathop \sum \nolimits_{{\left( {z_{m,i} ,z_{m,j} } \right) \in P_{m} }} I\left( {z_{m,i} = z_{m,j} } \right)}}{{\left| {P_{m} } \right|}}} \right\} \exp \left\{ {\lambda_{2} \frac{{\mathop \sum \nolimits_{{\left( {z_{m,f} ,z_{m,g} } \right) \in N_{m} }} I\left( {z_{m,f} \ne z_{m,g} } \right)}}{{\left| {N_{m} } \right|}}} \right\}p\left( {\theta |\alpha } \right){\text{d}}\theta $$
(11)

where \( p\left( {\vec{z}|\theta } \right) \) is a multinomial distribution and \( p\left( {\theta |\alpha } \right) \) follows Dirichlet distribution. Two binary potentials, \( \exp \left\{ {\lambda_{1} \frac{{\mathop \sum \nolimits_{{\left( {z_{m,i} ,z_{m,j} } \right) \in P_{m} }} I\left( {z_{m,i} = z_{m,j} } \right)}}{{\left| {P_{m} } \right|}}} \right\} \) and \( \exp \left\{ {\lambda_{2} \frac{{\mathop \sum \nolimits_{{\left( {z_{m,f} ,z_{m,g} } \right) \in N_{m} }} I\left( {z_{m,f} \ne z_{m,g} } \right)}}{{\left| {N_{m} } \right|}}} \right\} \), are not related to variable \( \theta \). Similar to \( p\left( {\vec{w}|\vec{z},\beta } \right) \), taking conjugacy between \( p\left( {\vec{z}|\theta } \right) \) and \( p\left( {\theta |\alpha } \right) \) into consideration we can obtain

$$ p\left( {\vec{z}|\alpha ,\lambda_{1} ,\lambda_{2} } \right) = \mathop \prod \limits_{m = 1}^{M} \frac{{\Delta \left( {\vec{n}_{m} + \vec{\alpha }} \right)}}{{\Delta \left( {\vec{\alpha }} \right)}} \exp \left\{ {\lambda_{1} \frac{{\mathop \sum \nolimits_{{\left( {z_{m,i} ,z_{m,j} } \right) \in P_{m} }} I\left( {z_{m,i} = z_{m,j} } \right)}}{{\left| {P_{m} } \right|}}} \right\} \exp \left\{ {\lambda_{2} \frac{{\mathop \sum \nolimits_{{\left( {z_{m,f} ,z_{m,g} } \right) \in N_{m} }} I\left( {z_{m,f} \ne z_{m,g} } \right)}}{{\left| {N_{m} } \right|}}} \right\}. $$
(12)

After having obtained \( p\left( {\vec{w}|\vec{z},\beta } \right) \) and \( p\left( {\vec{z}|\alpha ,\lambda_{1} ,\lambda_{2} } \right) \), the conditional distribution \( p\left( {z_{i} = k|\vec{z}_{\neg i} ,\vec{w}} \right) \) of Gibbs sampling can be derived combining Eqs. (6) and (7) as follows:

$$ \begin{aligned} & p\left( {z_{i} = k|\vec{z}_{\neg i} ,\vec{w}} \right) = \frac{{p\left( {\vec{w},\vec{z}|\alpha ,\beta ,\lambda_{1} ,\lambda_{2} } \right)}}{{p\left( {\vec{w},\vec{z}_{\neg i} |\alpha ,\beta ,\lambda_{1} ,\lambda_{2} } \right)}} \propto \frac{{p\left( {\vec{w},\vec{z}|\alpha ,\beta ,\lambda_{1} ,\lambda_{2} } \right)}}{{p\left( {\vec{w}_{\neg i} ,\vec{z}_{\neg i} |\alpha ,\beta ,\lambda_{1} ,\lambda_{2} } \right)}} \\ & \propto \frac{{\Delta \left( {\vec{n}_{k} + \vec{\beta }} \right)}}{{\Delta \left( {\vec{n}_{k,\neg i} + \vec{\beta }} \right)}} \cdot \frac{{\Delta \left( {\vec{n}_{m} + \vec{\alpha }} \right)}}{{\Delta \left( {\vec{n}_{m,\neg i} + \vec{\alpha }} \right)}} \cdot \exp \left( {\lambda_{1} \frac{{\mathop \sum \nolimits_{{j \in ML_{m,i} }} \left( {z_{j} = k} \right)}}{{\left| {ML_{m,i} } \right|}}} \right) \cdot \exp \left( {\lambda_{2} \frac{{\mathop \sum \nolimits_{{j \in CL_{m,i} }} \left( {z_{j} \ne k} \right)}}{{\left| {CL_{m,i} } \right|}}} \right) \\ & \propto \frac{{n_{m,\neg i}^{\left( k \right)} + \alpha }}{{\mathop \sum \nolimits_{k = 1}^{K} \left( {n_{m,\neg i}^{\left( k \right)} + \alpha } \right)}} \cdot \frac{{n_{k,\neg i}^{{\left( {w_{i} } \right)}} + \beta }}{{\mathop \sum \nolimits_{i = 1}^{V} \left( {n_{k,\neg i}^{{\left( {w_{i} } \right)}} + \beta } \right)}} \cdot \exp \left( {\lambda_{1} \frac{{\mathop \sum \nolimits_{{j \in ML_{m,i} }} \left( {z_{j} = k} \right)}}{{\left| {ML_{m,i} } \right|}}} \right) \cdot \exp \left( {\lambda_{2} \frac{{\mathop \sum \nolimits_{{j \in CL_{m,i} }} \left( {z_{j} \ne k} \right)}}{{\left| {CL_{m,i} } \right|}}} \right). \\ \end{aligned} $$
(13)

We can get latent topic assignments over each word in the corpus after finishing Gibbs sampling and then estimate the document-topic distribution \( \vec{\theta }_{m} \), topic-word distribution \( \vec{\varphi }_{k} \) as follows:

$$ \theta_{m}^{\left( k \right)} = \frac{{n_{m}^{\left( k \right)} + \alpha }}{{\mathop \sum \nolimits_{k = 1}^{K} \left( {n_{m}^{\left( k \right)} + \alpha } \right)}} $$
(14)
$$ \varphi_{k}^{{\left( {w_{i} } \right)}} = \frac{{n_{k}^{{\left( {w_{i} } \right)}} + \beta }}{{\mathop \sum \nolimits_{i = 1}^{V} \left( {n_{k}^{{\left( {w_{i} } \right)}} + \beta } \right)}} . $$
(15)

4 Experiment

This section evaluates the proposed model in this paper and compares it with five state-of-the-art baseline topic models on two benchmark datasets.

LDA: LDA is a classic unsupervised topic model that has been widely used in many tasks (Blei et al. 2003). Many extensions of knowledge-based topic model are all based on LDA.

  • GK-LDA: GK-LDA is a knowledge-based topic model which can exploit the lexical correlation knowledge in dictionary (Chen et al. 2013a). In addition, GK-LDA has the mechanism of dealing with wrong knowledge based on the ratio between word probabilities under each topic. However, GK-LDA cannot mine any prior knowledge automatically. Thus, we feed GK-LDA, the knowledge produced using the proposed knowledge mining method in our paper, which allows us to assess the knowledge fusion capabilities of each model.

  • MRF-LDA: A knowledge-based topic model that can use external knowledge of word correlation generated by word embeddings (Xie et al. 2015). However, MRF-LDA can only incorporate must-links, but ignores other forms of word–word correlation.

  • SC-LDA: A knowledge-based topic model that can incorporate prior word correlation knowledge based on Word-Net 3.0 and word embeddings (Yang et al. 2015c). Given that extracting word correlation knowledge from synsets in Word-Net 3.0 may restrict the quantity of knowledge generation, we feed the knowledge generated in MWCK-LDA into SC-LDA to compare the knowledge incorporation mechanism between them.

  • WE-LDA: WE-LDA is a knowledge-based topic model which can use must-links to improve topic modeling performance (Yao et al. 2017). Similar to MRF-LDA, WE-LDA cannot be able to incorporate other forms of knowledge.

4.1 Experimental settings

  • Datasets We use two public benchmark datasets for our experiments, which have been used to validate the performance of many topic models. The first dataset is 20-Newsgroups.Footnote 1 The 20-Newsgroups dataset has been one of the most commonly used benchmarks in natural language processing domain including topic modeling, text classification and text clustering. It collects about 20,000 newsgroup documents divided into 20 different categories. The second dataset we choose to evaluate the performance of topic models is NIPS.Footnote 2 The NIPS dataset is collected from Annual Conference on Neural Information Processing Systems (NIPS) published in the time range from 1987 to 1999 and it contains about 1500 documents. For 20-Newsgroups, we conduct the following preprocessing: (1) convert letters into lowercase and (2) remove stop words using stop words list of NLTK. For NIPS, we use the original documents for topic modeling.

  • Word embeddings In our paper, Web EigenwordsFootnote 3 is selected to be the source to mine word–word correlation of must-links and cannot-links. In Web Eigenwords, a word is represented by a real-valued vector. To generate word correlation knowledge, we should calculate the cosine distance between word vectors corresponding to two words.

  • Parameter settings In experiment for all models, posterior estimates of latent variables are obtained with 2000 iterations of Gibbs sampling. For two benchmark datasets, we set the number of latent topics as 100. For other parameters of baseline models, we set them according to their original papers. For LDA, \( \alpha \) and \( \beta \) are set as \( 50/K \) and 0.01. For GK-LDA, we set \( \alpha = 1 \) and \( \beta = 0.1 \). For MRF-LDA, we set \( \alpha \), \( \beta \) as \( 50/K \) and 0.01. In MRF-LDA, to incorporate must-links from Web Eigenwords we set threshold \( \mu_{1} \) to be 0.99 and \( \lambda_{1} \) to be 1. For SC-LDA, \( \alpha \) and \( \beta \) are set as \( 1.0 \) and 0.01. For WE-LDA, the parameters are set as suggested in the original paper. For MWCK-LDA model, we set \( \alpha \) and \( \beta \) to be \( 50/K \) and 0.01. \( \lambda_{1} \) and \( \lambda_{2} \) are both set as 1 to balance the effect between must-links and cannot-links. To generate knowledge of word–word correlation in this paper, two thresholds \( \mu_{1} \) and \( \mu_{2} \) are set as 0.99 and 0.1. Word pairs with similarity higher than \( \mu_{1} \) are labeled as must-links and word pairs with similarity lower than \( \mu_{2} \) are labeled as cannot-links.

4.2 Experimental results

In this section, we evaluate the performance of the developed model in both qualitative and quantitative evaluations. In Sect. 4.2.1, we present the qualitative evaluation of our model. In Sect. 4.2.2, we illustrate the performance of the proposed model from the perspective of the quantitative evaluation. In addition, we discuss the effects of the thresholds \( \mu_{1} \) and \( \mu_{2} \) to the modeling performance of MWCK-LDA in Sect. 4.2.3. Then finally, we discuss the effects of \( \lambda_{1} \) and \( \lambda_{2} \) to the MWCK-LDA.

4.2.1 Qualitative evaluation

In this section, we will evaluate our MWCK-LDA with the baseline models in qualitative. Table 3 shows some topic modeling results extracted from the NIPS dataset by six models. Each topic is represented by the top fifteen words, and we highlight the words that lack representativeness in bold. From the meanings of words contained in each topic, we can know that these four topics are about “Vision,” “Neural Net,” “Speech” and “Circuits,” respectively. Table 3 shows that MWCK-LDA proposed in this paper can mine more coherent topic results with fewer noisy and meaningless words than all other baseline models. From the topics extracted by LDA, we can see that some noisy words which cannot effectively characterize a topic appear at the top positions of each topic due to their high frequency, such as word “natural” in topic 1 and word “paper” in topic 2. This is because LDA is a totally unsupervised topic model which generates the words independently based only on word co-occurrence information and lacks the mechanism to incorporate external word correlation knowledge. As to GK-LDA and MRF-LDA, both models can be capable of applying word correlation knowledge of must-links during topic modeling process. Compared to standard LDA, they can take similarity relationships among words into consideration and mine topics that are more meaningful. We can see that the number of noisy words in topics inferred by GK-LDA and MRF-LDA is all lower than that in LDA. The difference between GK-LDA and MRF-LDA lies in the mechanism of incorporating and assessing word correlation knowledge. Compared to GK-LDA, MRF-LDA has a superior mechanism that can generate topics with better quality. The quality of topics generated by GK-LDA is unsatisfactory. For example, GK-LDA cannot solve the confusion problem in “Neural Net” topic which confuses a “paper” topic and a “neural network” topic, even though GK-LDA can extract higher-quality topics overall. In comparison, MRF-LDA can solve the above problem and generate more coherent topic. Although GK-LDA and MRF-LDA can be able to incorporate word correlation knowledge of must-links into topic modeling process, there are still some noise words existing in each topic. For example, in MRF-LDA, noise words such as “left” and “ieee” appear in topic “Vision,” and “pattern” and “spiral” appears in topic “Neural Net.” In GK-LDA, noise words such as “result” and “paper” appear in topic “Neural Net,” and “waibel” appears in topic “Speech.” These noise words in each topic are all semantic uncorrelated with other words. Our model MWCK-LDA defines the Mixed Markov Random Field at the latent topic layer to improve possibilities for correlated words to be assigned into the same topic and uncorrelated words to be assigned to different topics. As a result, Table 3 shows that the topics mined by our model are more coherent than those mined by baseline models. The extracted results by MWCK-LDA are more coherent and contain fewer meaningless words.

Table 3 Topics inferred from NIPS dataset

Table 4 shows the topics mined from 20-Newsgroups dataset. The four topics are about “Insurance,” “Sports,” “Health” and “Sex,” respectively. Form the table, the modeling results mined by our MWCK-LDA are far better than those mined by baseline models. It demonstrates that our model is more effective than other baseline topic models.

Table 4 Topics inferred from 20-Newsgroups dataset

4.2.2 Quantitative evaluation

For quantitative evaluation of topic modeling results, we choose Coherence Measure (CM) as quantitative metric to compare topic quality to other baseline models. Coherence Measure has been used as quantitative metric of topic modeling performance in many works of topic model (Qiang et al. 2017; Xie and Xing 2013; Xie et al. 2015). In our experiment, we select the top 15 words of each topic to assess whether words are relevant to current topic by judgments of human annotators. During the assessment process, we ask annotators to evaluate whether the meaning of current topic is obvious or not at the beginning. If not, the fifteen candidate words in current topic are both marked as to be irrelevant. Otherwise, annotators need to label words that are relevant to current topic. Finally, the metric of CM can be calculated by the percentage of the number of relevant words over total number of candidate words.

In our experiment, there are four graduate students participating in annotation experiment. For each dataset, ten topics were randomly selected for evaluating. The Coherence Measure results on NIPS dataset and 20-Newsgroups dataset are shown in Tables 5 and 6, respectively. We can observe that our model performs better than other baseline models on two benchmark datasets. From Tables 5 and 6, our model achieves 83.5% and 70.75% at the average Coherence Measure metric on the NIPS dataset and 20-Newsgroups dataset, respectively, which are higher than other baseline models. In addition, from the experimental results we can also assess the consistency performance by different annotators. We can observe that performance by different annotators shows good consistency between each other. Experimental results in quantitative evaluation demonstrate the effective of our model.

Table 5 CM (%) on NIPS dataset
Table 6 CM (%) on 20-Newsgroups dataset

4.2.3 Influence of \( \mu_{1} \) and \( \mu_{2} \)

In this section, we discuss the influence of the thresholds \( \mu_{1} \) and \( \mu_{2} \) in MWCK-LDA. In this experiment, we choose the metric of topic coherence to measure the quality of topic modeling results. The metric of topic coherence is proposed to assess topic quality. Given a topic z and its top T most probable words, metric of topic coherence is obtained as follows and a higher metric value implies a better performance of topic modeling:

$$ {\text{C}} = \frac{1}{K}\mathop \sum \limits_{z = 1}^{K} \mathop \sum \limits_{t = 2}^{T} \mathop \sum \limits_{l = 1}^{t - 1} \log \frac{{D\left( {w_{t}^{z} ,w_{l}^{z} } \right) + 1}}{{D\left( {w_{l}^{z} } \right)}} $$
(16)

where \( \left( {w_{1}^{z} , \cdots \cdots ,w_{T}^{z} } \right) \) is a list of the T most probable words in topic z. \( D\left( w \right) \) denotes the document frequency of word w. \( D\left( {w,w^{\prime}} \right) \) denotes the co-document frequency of \( w \) and \( w^{\prime} \). Figure 2 shows the metric of topic coherence over different \( \mu_{1} \) and \( \mu_{2} \) on NIPS dataset with the setting of K = 100 and T = 20. In addition, Fig. 3 presents the metric of topic coherence over different \( \mu_{1} \) and \( \mu_{2} \) on 20-Newsgroups dataset. Other parameters are all fixed as the settings in the experiment. As discussed in Sect. 3.2, thresholds \( \mu_{1} \) and \( \mu_{2} \) determine the quality of the generated word correlation knowledge.

Fig. 2
figure 2

Effect of \( \mu_{1} \) and \( \mu_{2} \) on NIPS dataset under K = 100 and T = 20

Fig. 3
figure 3

Effect of \( \mu_{1} \) and \( \mu_{2} \) on 20-Newsgroups dataset under K = 100 and T = 20

Figure 2a shows that MWCK-LDA achieves the highest topic coherence value when \( \mu_{1} = 0.5 \) on NIPS dataset. This indicates that setting the value of \( \mu_{1} \) equals 0.5 can generate higher quality of must-link correlation knowledge. When the value of \( \mu_{1} \) is less than the threshold of 0.5, the generated must-link correlation knowledge contains much noise and this will hinder our model inferring high-quality topics. In contrast, when the value of \( \mu_{1} \) is set to more than 0.5, the generated must-link correlation knowledge will be limited because of the high threshold. The limited must-link correlation knowledge will also hinder our model working well. Figure 2b shows that MWCK-LDA achieves the highest topic coherence value when \( \mu_{2} = 0.45 \) on NIPS dataset. This indicates that setting the value of \( \mu_{2} \) equal to 0.45 can generate higher quality of cannot-link correlation knowledge. Different from lexical correlation of must-links, when the value of \( \mu_{2} \) is set to more than 0.45, the generated cannot-link correlation knowledge will contain much noise knowledge and hinder MWCK-LDA mining coherent topics. Furthermore, when the value of \( \mu_{2} \) is set to less than 0.45, the generated cannot-link knowledge will be limited. Similar phenomenon can also be observed in 20-Newsgroups dataset from Fig. 3a, b. Figure 3a, b shows that when \( \mu_{1} \) is set as 0.55 or \( \mu_{2} \) is set as 0.3 our model performs best on 20-Newsgroups dataset. The optimal values of \( \mu_{1} \) or \( \mu_{2} \) depend on the specific dataset. In addition, we chose Web Eigenwords as the source to mine lexical correlation knowledge. The approximate thresholds of \( \mu_{1} \) and \( \mu_{2} \) also depend on the selected word vectors.

4.2.4 Influence of \( \lambda_{1} \) and \( \lambda_{2} \)

In this section, we investigate the effect of parameters \( \lambda_{1} \) and \( \lambda_{2} \) in MWCK-LDA. Figure 4 shows the topic coherence over different \( \lambda_{1} \) and \( \lambda_{2} \) on NIPS dataset with the setting of K = 100 and T = 20. Figure 5 shows the topic coherence over different \( \lambda_{1} \) and \( \lambda_{2} \) on 20-Newsgroups dataset with the setting of K = 100 and T = 20. When changing parameters \( \lambda_{1} \) or \( \lambda_{2} \), other parameters are all fixed, following the experimental settings. As discussed in Sect. 3.2, in our model parameter \( \lambda_{1} \) balances the effect between external correlation knowledge of must-links and other information patterns when inferring latent topics. Figure 4a shows that MWCK-LDA has the highest topic coherence value when \( \lambda_{1} = 5 \) on NIPS dataset. In 20-Newsgroups dataset, when the value of \( \lambda_{1} \) equals 1, our model performs the best. This indicates that we need to set specific parameter value of \( \lambda_{1} \) for different datasets to balance the effect of prior correlation knowledge of must-links and other information sources. For a specific dataset, we can find an optimal value of \( \lambda_{1} \). When the parameter of \( \lambda_{1} \) is less than the suitable threshold, the effect of must-link correlation knowledge on topic extraction will be weakened. In contrast, when the parameter of \( \lambda_{1} \) is more than the threshold, the effect of other information patterns like co-occurring word pair et al. on topic extraction will be weakened. It is important to balance the effects of different information sources on topic sampling in MWCK-LDA to mine coherent topic results. The situations discussed above will hinder our model working well. As for parameter of \( \lambda_{2} \), in MWCK-LDA \( \lambda_{2} \) balances the effect between prior cannot-link correlation knowledge and other knowledge patterns when topic modeling. Figures 4b and 5b show that when the value of \( \lambda_{2} \) equals 1, our model performs the best on both 20-Newsgroups dataset and NIPS dataset. This indicates that we should set \( \lambda_{2} \) equals 1 for different datasets to incorporate cannot-link correlation knowledge well.

Fig. 4
figure 4

Effect of \( \lambda_{1} \) and \( \lambda_{2} \) on NIPS dataset under K = 100 and T = 20

Fig. 5
figure 5

Effect of \( \lambda_{1} \) and \( \lambda_{2} \) on 20-Newsgroups dataset under K = 100 and T = 20

5 Conclusions

In this paper, we propose a Mixed Word Correlation Knowledge-based Latent Dirichlet Allocation topic model (abbr. to MWCK-LDA) to infer latent topics from documents corpus. We find that MWCK-LDA can incorporate both must-links and cannot-links generated by word embeddings in a soft manner. To incorporate the above knowledge, the Mixed Markov Random Field is constructed over the latent topic layer to regularize the topic assignment of each word during the topic modeling process, which will give must-links a better chance to be put into the same topic and cannot-links a better chance to be not. In addition, the developed knowledge incorporation mechanism enable a good balance between two forms of external knowledge and word co-occurrence information contained in documents when topic sampling. Experimental results show that MWCK-LDA can achieve significantly better performance than baseline topic models. As future research work, we will focus on how to incorporate more abundant knowledge forms into topic models to improve the coherence of modeling results further.