1 Introduction

Learning is often considered as a lifelong process of requiring knowledge and mastering new skills throughout human life. To accumulate knowledge from past and meanwhile avoiding catastrophic forgetting [36], lifelong learning has been studied in a wide range of machine learning tasks [4, 17, 50].

One-shot topic models, such as the latent Dirichlet allocation (LDA) [3], DocNADE [31], the adversarial topic model (ATM) [55], and the bidirectional adversarial topic model (BAT) [54] have shown remarkable success in exploring semantic patterns from a static document collection. Yet the lack of guidance from prior knowledge limits the performance of above methods on text streams. Early efforts have demonstrated how lifelong topic models could be incrementally learned for streaming data [19, 57], but they take the probabilistic perspective to estimate parameters and often involve complex derivation. Recently, a lifelong neural topic model named LNTM [16] is developed based on DocNADE [31] with more flexible training schemes than probabilistic models. However, it only considers the words appeared before the target word while ignores the following words in the sequence [15]. Besides, topic models based on NADE [2], including DocNADE and iDocNADE [15], do not consider the relationship between topics since they are trained in the document-word space. As mentioned in [54], the relationship between topics is useful for improving the model performance on topic coherence and downstream tasks. Therefore, a more generic lifelong neural topic model that can enable continual learning using comprehensive and topic relationship information is valuable.

In this paper, we develop a knowledge-enhanced adversarial neural topic model (KATM) and extend it to LKATM by knowledge distillation and data augmentation. Adversarial neural topic models [22, 54, 55] use a generator network to capture the semantic structure of documents through adversarial training, which overcomes the limitation of complex derivation in probabilistic models and unable to generate coherent topic words in variational auto-encoder (VAE) based neural topic models. To keep the memory of learning previous tasks, we further transfer prior topic information in the current task by knowledge distillation [5]. Figure 1 presents an example of learning topic words in a lifelong process. Suppose we have learned the representative words of three topics (computer, fruit, and sport) from task \(t-1\). Given a new task t, we expand the topic words based on previous results and learn a new coherent topic. To achieve this, the main challenges are: (1) how to extract knowledge from the current topic model; (2) how to exploit useful semantic patterns from past models by modeling the topic relationships; (3) how to avoid or minimize catastrophic forgetting of prior topic knowledge. In light of these considerations, we summarize the main contributions of this work as follows:

  • We develop the KATM by training a knowledge extractor to retrieve semantic patterns of documents generated from the generator network. This enables our model to extract topic knowledge from the generator and encourages it to learn more interpretable document representations.

  • We propose the LKATM to incorporate semantic patterns from previous trained models into the current model and utilize data augmentation to avoid the conflicts caused by the inconsistent output of different models. To the best of our knowledge, we are the first to develop a lifelong neural topic model based on adversarial networks, in addition to utilize the principle of knowledge distillation for lifelong neural topic modeling.

Fig. 1
figure 1

An example of topic words learning in a lifelong process

We evaluate the effectiveness of our KATM and LKATM on four real-world text streams. Experimental results demonstrate that the coherence and uniqueness of topics generated by our models are improved significantly when compared with state-of-the-art approaches. The quality of document representations from different models has also been tested on document classification.

2 Related work

In this section, we briefly introduce lifelong machine learning, neural topic modeling, and knowledge distillation which are related to our work.

2.1 Lifelong machine learning

Lifelong machine learning is capable of training a model from data streams. It aims to integrate the current knowledge into the model without catastrophic forgetting over time [44]. The existing lifelong machine learning studies mainly focus on the following research directions: (1) Dynamic architecture based methods [8, 35, 50] which expand model architectures for new tasks to avoid losing the previous learned knowledge, e.g., re-training with an additional number of neurons or network layers. While the methods of introducing new neurons and network layers alleviate the catastrophic forgetting issue in nonstationary environments [40], they do not resemble biologically plausible mechanisms [44] and may be inapplicable to natural language processing models with fixed neurons. (2) Lifelong machine learning with auxiliary data [5, 17, 47, 48] which restores a few examples in previous tasks and incorporates them into the current task to tackle catastrophic forgetting. This is similar to humans who review previous tasks to acquire knowledge. By training with the same data sampled from each task, a method can study the shared high-level representations of streaming data. Learning with auxiliary data has been widely studied for over two decades and still used nowadays because of its effectiveness. (3) Parameter consolidation methods [16, 53] which constraint on the update of the neural weights. Strategies of emphasizing important parameters from previous tasks have been proposed in [16, 30, 60], e.g., introducing a quadratic penalty on the difference between the parameters for prior and new tasks. However, these methods may lead to calculation issues if the neural architectures become very large. On the other hand, Donahue et al. [9] attempt to prevent significant changes in the network parameters when training with new data by reducing the learning rate. Besides, a regularization term related to the prior loss [32] is proposed to mitigate catastrophic forgetting. Unfortunately, its effectiveness is highly affected by the performance of previous models. In summary, parameter consolidation methods provide a way to learn continual tasks under certain conditions [44] while they are still worthy of further researches. Different from the above lifelong machine learning methods, our approach aims to minimize the difference of knowledge that is extracted from tasks with auxiliary information over data streams. Specifically, given a new task, the current model learns a soft target extracted from previous models to minimize catastrophic forgetting in the lifelong process.

2.2 Neural topic modeling

Topic modeling has been widely used in text mining, including document clustering, information recommendation, and information retrieval [11, 23, 26]. Traditional topic models rely on approximate approaches (e.g., variational inference and Gibbs sampling) to estimate parameters [3, 25]. However, variational inference often involves complex derivation and Gibbs sampling requires high computational costs. To address these weaknesses, VAE and neural variational inference (NVI) [41] are used as the frameworks of several preliminary neural topic models [37, 38, 51] due to their flexible and fast parameter inference.

With the rapid development of generative adversarial net (GAN) [14], there is a new direction to discover topics based on GAN. For instance, ATM [55] is proposed by using Dirichlet priors for latent topics instead of multivariate Gaussian priors or logistic-normal priors. This model aims to train a generator to learn the mapping from the document-topic distribution to the document-word distribution. Inspired by bidirectional adversarial training, BAT [54] builds an encoder to capture real topic distributions combined with fake distributions from the generator. To handle labeled documents, a cycle-consistent adversarial topic model [22] is proposed. Apart from the above methods, the adversarial-neural event model [56] is proposed for extracting the structured representations of open-domain events. To address the lack of data representations in the topic space and the limitation of spending a lot of time to manually label useful topics, a reward function and a topic predictor are integrated into GAN [12]. In our approach, a knowledge extractor is added into GAN, which aims to encourage the generator to learn more interpretable and meaningful representations, by minimizing the difference between the generator input and the knowledge extractor output.

2.3 Knowledge-enhanced NLP methods

With the development of deep learning technologies, the input text alone contains limited knowledge to support models producing satisfactory output. Incorporating knowledge into NLP models becomes a promising direction in both academia and industry [59]. Recently, developing specialized architectures is widely studied to process knowledge, including attention network based methods [7, 13, 18, 46], graph neural network based methods [61, 62], and memory network based methods [34, 58]. Knowledge-enhanced learning is agnostic to the model architecture and can be combined with various architectures. However, the sources of knowledge should not be limited to a single network structure, dictionary, and table [59]. The reason is that knowledge transferring by learning from multi-domain sources can discover knowledge more broadly and meanwhile improve the knowledge generation process.

Knowledge distillation is an effective solution for knowledge transferring, by using the predicted distributions of a teacher model as soft targets to train a less-parameterized student model [20]. Recent efforts have demonstrated how the refined soft predictions could improve the generation of student model as compared with hard labels [28, 33]. Furthermore, the flexible methods have extended to scenarios where all student models distill knowledge without a pre-trained teacher model by learning from peers’ predictions [6].

In the field of topic modeling, BERT-based auto-encoder teacher model [21] combines the advantages of probabilistic topic models and pre-trained transformers by mapping documents through a standard bag-of-words representation and a teacher model. Unlike the above model, our LKATM directly takes the model trained from previous task as the teacher model to generate the current document-topic distribution better.

3 Methodology

In this section, we firstly describe the task of lifelong topic modeling. Then, we introduce the proposed KATM. Finally, we extend KATM to LKATM with knowledge distillation and data augmentation for lifelong topic modeling.

3.1 Problem formulation

Consider a stream of documents \(\varOmega =\{\varOmega ^1, \varOmega ^2, \varOmega ^3,...\}\) accumulated over lifetime. During the training of the \(t^{th}\) task, there are document collections of \(\mathbb {D}^t\) paired instances \(\{(\varvec{d}_{\varvec{r}}^t,\varvec{\theta }^t) | \varvec{d}_{\varvec{r}}^t \in \varvec{d}_{\varvec{r}},\varvec{\theta }^t \in \varvec{\theta }\}_{t=1}^{+\infty }\) where \(\varvec{d}_{\varvec{r}}^t\) denotes the set of real documents and \(\varvec{\theta }^t\) denotes the topic distribution to generate the corresponding fake document \(\varvec{d}_{\varvec{f}}^t\). For the \(t^{th}\) trained model \(M^t\), the goal is to generate \(\varvec{d}_{\varvec{f}}^t \leftarrow \varvec{\theta }^t\) as similar as \(\varvec{d}_{\varvec{r}}^t\), without forgetting how to generate documents of previous tasks \(\varvec{d}_{\varvec{f}}^j \leftarrow \varvec{\theta }^j\) where \(j=(1,2,...,t-1)\).

Inspired by knowledge distillation, a student model is trained by the predicted soft distribution from a teacher model, in which, we treat the current model \(M^t\) as the student model and \(M^{t-1}\) as the teacher model. Knowledge distillation is used to extract valuable information from \(M^{t-1}\) to \(M^t\) by encouraging these two models to produce similar output or patterns with the same data as input. In addition, the corpus \(C=\{C^1, C^2, C^3,...\}\) is augmented each time followed by training a new task to accumulate knowledge.

3.2 KATM: Knowledge-enhanced adversarial neural topic model

We here present our KATM, which aims to encourage the generator to learn more interpretable and meaningful document representations. We accomplish it by minimizing the difference between the sampled document-topic distribution \(\varvec{\theta }\) and the generated document-topic distribution \(\widetilde{\varvec{\theta }}\). As shown in Figure 2, KATM contains four components: a real document set, a generator G, a discriminator D, and a knowledge extractor E. The generator contains a K-dimensional document-topic distribution layer, an S-dimensional representation layer, and a V-dimensional document-word distribution layer. The discriminator consists of a V-dimensional document-word distribution layer, an S-dimensional representation layer, and an output layer. The knowledge extractor included in discriminator contains a K-dimensional document-topic distribution layer, which generates the document-topic distribution \(\widetilde{\varvec{\theta }}\) by softmax normalization.

Fig. 2
figure 2

The framework of KATM on the \(t^{th}\) task, i.e., \(M^{t}\)

Following ATM [55], we train generator G to obtain a document-word distribution by transforming a K-dimensional noise variable \(\varvec{\theta } \sim Dir\left( \varvec{\theta } \mid \varvec{\alpha }\right)\) into a V-dimensional sample \(\varvec{d}_{\varvec{f}}\), where \(\varvec{\alpha }\) is the hyperparameter of Dirichlet distribution. The generator is guided by an adversarial discriminator D which aims to distinguish the fake document-word distribution \(\varvec{d}_{\varvec{f}}\) from the true document-word distribution \(\varvec{d}_{\varvec{r}}\). The real documents in the corpus are represented by TFIDF, noted as \(\mathbb {P}_r\). The real distributions can be viewed as random samples drawn from \(\mathbb {P}_r\). Formally, the adversarial loss is given by \(\min _{G} \max _{D} V(D, G)=E_{\varvec{d}_{\varvec{r}} \sim \mathbb {P}_r}[\log D(\varvec{d}_{\varvec{r}})] + E_{\varvec{\theta } \sim Dir\left( \varvec{\theta } \mid \varvec{\alpha }\right) }[\log (1-D(G(\varvec{\theta })))]\).

Note that the noise vector fed into the generator is pre-determined and fixed. Different from [54, 55], we propose to take the input vector as a latent code and train it as the target rather than simply take it as a noise vector. In this way, our model not only gets the semantic feature of documents better through learning prior knowledge, but also infers the document-topic distribution explicitly. Particularly, we develop a knowledge extractor to capture the document-topic distribution from each generated document. As shown in the bottom-right part of Figure 2, the knowledge extractor is a K-dimensional single-layer neural network included in discriminator. As part of the discriminator’s embedding layer, it takes the fake document \(\varvec{d}_{\varvec{f}}\) as input and outputs the topic distributions \(\widetilde{\varvec{\theta }}\) by softmax normalization. We use the weight of knowledge extractor, i.e., a \(K \times V\)-dimensional matrix, as the topic-word distribution.

Suppose generator G could generate documents the same as sampling from the corpus and knowledge extractor E could retrieve the semantics of fake documents, the difference between prior document-topic distributions \(\varvec{\theta }\) and output \(\widetilde{\varvec{\theta }}\) should be small. The adversarial loss mentioned above encourages the generator to generate documents matching the data distribution in the corpus, and meanwhile the knowledge layer loss promotes the generator to construct a more explainable document containing some given semantic information. Specifically, the Kullback-Leibler (KL) divergence between \(\varvec{\theta }\) and \(\widetilde{\varvec{\theta }}\) is used to define the aforementioned difference, as follows:

$$\begin{aligned} \mathcal {L}_{K}=\underset{i}{\sum } \varvec{\theta }_i \log \frac{\varvec{\theta }_i}{\widetilde{\varvec{\theta }}_i}. \end{aligned}$$
(1)

Finally, KATM’s loss function is defined as the following formula with a regularization term of KL divergence between \(\varvec{\theta }\) and \(\widetilde{\varvec{\theta }}\):

$$\begin{aligned} \underset{G, E}{\min }~\underset{D}{\max } V_{KATM}(D, G, E)=V(D, G) + \lambda _k \mathcal {L}_{K}, \end{aligned}$$
(2)

where \(\lambda _k\) is a hyperparameter.

3.3 LKATM: Lifelong knowledge-enhanced adversarial neural topic model

Knowledge distillation The simplified form of knowledge distillation is defined as follows: a student model is trained by a soft target distribution which is produced by a teacher model with a user-specified temperature. Given the teacher model’s output of the last fully connected layer \(g_i\) and temperature T, the soft output \(\theta _i\) is defined by:

$$\begin{aligned} \theta _{i}=\frac{\exp \left( g_{i} / T\right) }{\sum _{j} \exp \left( g_{j} / T\right) }. \end{aligned}$$
(3)

Then, knowledge is transferred by combining the student model’s predicted distribution, which is produced using the same temperature T in such a model’s soft output, with the teacher model’s distribution \(\varvec{\theta }\). A higher T means a softer distribution. Based on KATM, Figure 3 presents the framework of LKATM that enables topical knowledge transfer from different domains without catastrophic forgetting. It can also be understood as distilling document-topic distributions generated from previous tasks. As mentioned earlier, KATM outputs \(\widetilde{\varvec{\theta }}\), i.e., a document-topic distribution from the knowledge extractor. Given the same noise document-topic distribution \(\varvec{\theta }\), the models \(M^t\) and \(M^{t-1}\) are encouraged to generate the same output. In our approach, we define the loss for knowledge distillation as follows:

$$\begin{aligned} \mathcal {L}_{DL}=KL(\widetilde{\varvec{\theta }}^{t-1}, \widetilde{\varvec{\theta }}^{t})=\sum _{i} \widetilde{\varvec{\theta }}_i^{t-1} \log \frac{\widetilde{\varvec{\theta }}_i^{t-1}}{\widetilde{\varvec{\theta }}_i^{t}}. \end{aligned}$$
(4)
Fig. 3
figure 3

Overview of LKATM

The overall loss of the \(t^{th}\) task is given below:

$$\begin{aligned} \underset{G^t, E^t}{\min } ~\underset{D^t}{\max } V_{LKATM}(D^t, G^t, E^t)=V(D^t, G^t) + \lambda _k\mathcal {L}_{K}+T^2\mathcal {L}_{DL}, \end{aligned}$$
(5)

where \(\mathcal {L}_{DL}\) is multiplied by \(T^2\) to ensure that the relative contribution of distillation term remains roughly unchanged if the temperature is changed while experimenting with meta-parameters [20].

figure a

Data augmentation The performance of deep learning models largely depends on the amount of training data [10]. Data augmentation attempts to manipulate data for training to improve the model’s generalization ability. Note that (5) contains conflicted objectives. The first and second items encourage inputs to fit model \(M^t\), while the third item encourages \(M^t\) to generate the same output as that of model \(M^{t-1}\). The conflicts make it difficult for the model to learn topics efficiently.

To tackle the above problem, we propose to use data augmentation by adding top N real documents in \(\varOmega ^{t-1}\) evaluated by the performance on discriminator \(D^{t-1}\) into the corpus \(C^t\). The use of data augmentation can remove these conflicts. In addition, the vocabulary dimension of ground truth data will change across domains. The dimensional mismatch problem occurs when the model is training, so it is necessary to update \(C^t\) each time, which is a common operation of data augmentation.

The overall algorithm is shown in Algorithm 1. For the \(t^{th}\) task, we use \(n_d^t\) to denote the number of discriminator’s training iterations per generation iteration. Furthermore, \(m^t\) is the batch size, \(c^t\) is the clipping parameter, and \(\lambda _k^t\) represents the weight of knowledge extractor.

4 Experiments

In this section, we evaluate our KATM and LKATM by answering the following questions.

  • Q1. Does KATM effectively minimize the difference between prior and generated document-topic distributions? (Section 4.1)

  • Q2. Does knowledge extractor learn better topic-word distributions than generator? (Section 4.1)

  • Q3. How does KATM perform when compared with other one-shot neural topic models? (Sections 4.2 and 4.3)

  • Q4. How does LKATM perform when compared with the state-of-the-art lifelong neural topic model? (Sections 4.2 and 4.3)

  • Q5. How does temperature affect LKATM’s performance? (Section 4.4)

  • Q6. How does LNTM perform in a downstream task? (Section 4.5)

Datasets Following [16], we use four real-word datasets for training: AGnews, Tag My News (TMN), Reuters 21578 corpus (R21578), and 20NewsGroups corpus (20NS). Note that non UTF-8 characters and stop words are eliminated. All test datasets (i.e., 20NSshort, TMNtitle, and R21578title) in [16] are used to measure the model performance over short text. Besides these short-text test datasets, we also employ a long-text test dataset, i.e., grolierFootnote 1 to perform lifelong topic modeling. The statistics of datasets are shown in Table 1. Similar to [16], we construct the following data streams for evaluation:

  • \(AGnews \rightarrow TMN \rightarrow R21578 \rightarrow 20NS \rightarrow 20NSshort\)

  • \(AGnews \rightarrow TMN \rightarrow R21578 \rightarrow 20NS \rightarrow TMNtitle\)

  • \(AGnews \rightarrow TMN \rightarrow R21578 \rightarrow 20NS \rightarrow R21578title\)

  • \(AGnews \rightarrow TMN \rightarrow R21578 \rightarrow 20NS \rightarrow grolier\)

Taking the second data stream as an example, we train sequentially on AGnews, TMN, R21578, and 20NS in a lifelong process. Based on the prior models, TMNtitle is adopted as the current dataset to demonstrate whether catastrophic forgetting is avoided. After training on TMNtitle, the current model is used to generate topics and applied for downstream tasks. Specifically, our model includes a generator, a discriminator, and a knowledge extractor. As topics are generated from the current knowledge extractor, our experiments, i.e., topic quality comparison and document classification, are mainly carried out on it. To perform the task of document classification in the lifelong process, we first get the trained model mentioned above, and then input the document-word distributions which are converted from the current dataset into the knowledge extractor and take the outputs as document-topic distributions. More details will be introduced in Section 4.5. For clarity, the descriptions of all datasets are given below:

  1. 1.

    AGnews: a data collection provided by ComeToMyHead for research purposes in text mining, information retrieval, and so forth.

  2. 2.

    TMN: a news dataset labelled with 7 categories. Each news story contains a title and a description.

  3. 3.

    R21578: a collection of new stories from the natural language toolkit (NLTK)Footnote 2. NLTK is a suite of open source Python modules, data sets, and tutorials.

  4. 4.

    20NS: a collection of news stories partitioned across 20 newsgroups.

  5. 5.

    grolier: the Grolier multimedia encyclopedia articles. Its content covers almost all the fields in the world, such as sports, economics, and politics.

  6. 6.

    TMNtitle: titles of the TMN dataset.

  7. 7.

    R21578title: titles of the R21578 corpus.

  8. 8.

    20NSshort: documents from 20NS with document size (i.e., the number of words in a document) less than 20.

Table 1 The statistics of datasets

Baselines We adopt the following models for comparison: NVDM [38], NVLDA [51], DocNADE [31], iDocNADE [15], ATM [55], BAT [54], SCH. \(+\) BAT [21], and LNTM [16]. For completeness, we present the characteristics of these baselines and our models in Table 2.

Table 2 Characteristics of baselines and our models

Network architecture For generator, discriminator, and knowledge extractor in KATM and LKATM, we use feed-forward neural networks with ReLU activation [42] and batch normalization (BN) [24]. The detailed transformations of generator are: [Linear\((K, S) \rightarrow\) ReLU \(\rightarrow\) BN \(\rightarrow\) Linear\((S, V) \rightarrow\) Softmax], those of discriminator are: [BN \(\rightarrow\) Linear\((V, S+K) \rightarrow\) ReLU \(\rightarrow\) Linear\((S+K, 1)\)], and those of knowledge extractor are: [BN \(\rightarrow\) Linear\((V, K) \rightarrow\) Softmax]. In the above, Linear() denotes a linear transformation.

In our experiments, we set the hyperparameters of KATM and LKATM as follows: \(n_d = 5, m = 64, c = 0.01, \lambda _k = 1, T=3,\) and \(S=150\). We update model parameters using Adam [29] with \(\beta _1=0.5, \beta _2=0.999\), and \(\epsilon =10^{-4}\).

4.1 Effectiveness of the knowledge extractor

As mentioned earlier, the knowledge extractor is trained to refine topic distributions of the generated documents the same as the sampled document-topic distributions. To evaluate whether the sampled document-topic distributions are similar as the generated document-topic distributions trained by the proposed method, we train KATM on the 20NS dataset with 50 topics. The result in Figure 4 indicates that the divergence \(\mathcal {L}_{k}\) is maintained at about 1.50.

Fig. 4
figure 4

Divergence of ATM+E and KATM over training iterations

For comparison, we also train ATM [55] with an auxiliary knowledge extractor E when the generator is not explicitly encouraged to minimize the divergence between prior and generated document- topic distributions. The result shows that the divergence in ATM+E quickly increases to \(\mathcal {L}_{k} \approx 3.90\). This indicates that in ATM+E, there is no guarantee that the generator could make use of document-topic distributions to generate documents with enough semantic information.

Table 3 Average topic coherence scores on AGnews, TMN, R21578, and 20NS with topic number setting as [20, 30, 50, 75, 100]. Given a dataset, the best value on each metric is highlighted by boldface
Table 4 Average topic coherence scores on AGnews, TMN, R21578, 20NS, 20NSshort, R21578title, TMNtitle, and grolier with topic number setting as [50, 100]. Given a dataset, the best value on each metric is highlighted by boldface

In addition, to explore whether the knowledge extractor can generate topic-word distributions effectively, we use another adversarial network to generate the word distribution of each topic from the generator instead of the knowledge extractor, which is named as KATM_G. We use four widely-adopted topic coherence metrics, i.e., C_V [49], C_A [1], NPMI [1], and UMass [39] to evaluate the performance of different models. A higher topic coherence value means more understandable topics are extracted. All coherence values are calculated by the Palmetto libraryFootnote 3 over top 10 words of 50 topics according to the generated topic-word distribution. The result is shown in Table 3, which indicates that topic words generated by the knowledge extractor in our KATM achieves better performance than KATM_G on AGnews, TMN, R21578, and 20NS. This validates that the knowledge extractor in adversarial neural topic models is useful to capture coherent topics.

Table 5 Top 10 words of 4 representative topics extracted by LNTM and LKATM, where irrelevant words are marked by italics. These 4 topics indicate ‘compute’, ‘political’, ‘sports’, and ’agriculture’, respectively

4.2 Topic coherence comparison

In this task, we evaluate the performance of the proposed models and baselines using topic coherence metrics mentioned above. The numbers of topics are set to 20, 30, 50, 75, and 100, except for 50 and 100 in iDocNADE since it represents a document by summing the vectors of its words through Glove embeddings [45]. The averaged coherence scores are calculated as the final results, as shown in Table 3. These results indicate that the performance of KATM is better than others (except for iDocNADE, which performs better in UMass score on 20NS). Furthermore, LKATM maintains competitive topic coherence scores in the lifelong process, even better than KATM on TMN and 20NS. It validates that LKATM can effectively avoid catastrophic forgetting and learn from past models to obtain high quality topics.

In addition, we compare the average topic coherence scores of our LKATM with the existing lifelong neural topic model LNTM. The detailed topic coherence scores are shown in Table 4. Since LNTM represents a document by summing the word vectors through Glove embeddings [45], the topic numbers are set to 50 and 100 included in the pre-trained Glove model to calculate topic coherence. Each value is calculated by averaging coherence scores over top 10 words. We highlight the best topic coherence value on each metric by boldface. Among all the metrics, LKATM achieves the best performance on training datasets (AGnews, TMN, R21578, and 20NS), and also better on testing datasets (20NSshort, R21578title, TMNtitle, and grolier).

As an illustration, Table 5 presents top 10 words of 4 representative topics extracted by LNTM and LKATM. The result shows that the proposed LKATM can generate more coherent topics.

4.3 Topic uniqueness comparison

As mentioned in [43], neural topic models tend to generate high coherence scores but identical topics to minimize loss. It is also important to generate topics which are diverse instead of repetitive. Thus, we compute topic uniqueness (TU) scores proposed in [43] to estimate the discrimination of topics. Given a set of top-n representative words from each of the K topics, the TU score for topic k is inversely proportional to the number of times each of top-n word is repeated in the set. And the average TU is computed by TU = \(\frac{1}{K}\sum _{k=1}^K\)TU(k). The range of TU value is between \(\frac{1}{K}\) and 1. A higher TU means the produced K topics are more diverse.

We compare the TU scores of our KATM with one-shot topic models mentioned above. The result is shown in Table 6, from which we can observe that GAN based methods (i.e., ATM, BAT, and KATM) can generate more diverse topics than other models. Table 7 presents the TU scores of the proposed lifelong learning method LKATM and LNTM. We observe that LKATM achieves higher TU scores than LNTM across all the datasets, which indicates that LKATM captures more diverse topics.

Table 6 TU scores of one-shot models with 50 and 100 topics, where top 10 words are used for calculation. The best value on each dataset is highlighted by boldface
Table 7 TU scores of lifelong models with 500 and 100 topics, where top 10 words are used for calculation. The best value on each dataset is highlighted by boldface
Fig. 5
figure 5

Topic coherence scores on C_V, C_A, NPMI, and UMass at different temperatures

Table 8 Classification results of DocNADE, iDocNADE, LNTM, and LKATM when combined with the LightGBM classifier. Given a dataset, the best value on each metric is highlighted by boldface

4.4 Impact of the temperature

To explore how the topic coherence scores vary with respect to different temperatures for our LKATM, we show the topic coherences on four test datasets in Figure 5. In terms of C_V, C_A, and UMass, our LKATM achieves the best performance when the temperature is set to 4. While for NPMI, the best-performing temperature in LKATM is 3. The result indicates that either low or high temperatures will reduce the topic quality. This is because a low temperature may not distill sufficient knowledge, while a high temperature distills too much knowledge to learn the current task. Besides, different datasets present approximately the same trend.

4.5 Application to document classification

Our method is able to generate more coherent topics and potentially interpretable document representations, which can be beneficial to downstream tasks such as document classification. We employ TMN and 20NS datasets and compare the proposed LKATM with DocNADE, iDocNADE, and LNTM in this experiment. For each dataset, we randomly select 80% and 20% data as the training set and the testing set, respectively. LightGBM [27], a highly efficient gradient boosting decision tree, is adopted as the classifier. Particularly, it takes the document-word distribution represented by TFIDF as the input [52]. In our method, we use TFIDF as the knowledge extractor’s input and obtain the topic distributions by softmax normalization, that is, we represent each text using the product of TFIDF and the transposed topic-word distributions. To ensure fair comparisons, the same process is performed for all baselines. Table 8 presents the MacroF1, MicroF1, and AUC scores of the LightGBM classifier when using the original TFIDF and the text representations based on DocNADE, iDocNADE, LNTM, and our LKATM. The results indicate that the representation of documents generated by our method is much better to document classification as compared with these baselines.

5 Conclusions

In this work, we proposed a knowledge-enhanced topic model named KATM and a lifelong neural topic model based on KATM (i.e., LKATM) for capturing coherent topics. KATM discovers topics in a document by training a knowledge extractor, which promotes the generator to train more meaningful documents by processing each input vector as a target. LKATM utilizes knowledge distillation and data augmentation to transfer prior topic cues into the current task while avoiding catastrophic forgetting. We empirically demonstrated that the proposed methods achieved better topic coherence and uniqueness than state-of-the-art topic models on various benchmark datasets.