Keywords

1 Introduction

Lot of applications involving short texts need to possess the ability to learn topics on the fly as new data points arrive in the context of an evolving system. For example, consider the case when an organization tries to address the issue of understanding customer feedback (which is typically short text) using topic modeling. With the constant churning of feedback from customers it is not very prudent to run the topic modeling algorithm on complete data on every update (whenever a new feedback is collected). Also as an organization introduces new services and functionalities on the existing issues in their products/services, the nature of user supplied feedback texts changes over time. To capture the thematic content of evolving feedback materials an online topic modeling algorithm should be in place that can give more importance to the current feedback than the older feedback texts. Motivated by this, we propose an online topic discovery algorithm (OTDA) for streaming short texts which incorporates word-context semantic correlation learnt from the skip-gram view of the corpus much like the semantics-assisted NMF (SeaNMF) model [19], with the adaptivity of forgetting mechanism [5]. The SeaNMF model is solved using a block co-ordinate descent (BCD) algorithm. The well known methods for solving BCD need to hold the entire data matrix in the memory throughout the process of computation which can be prohibitive in case of large amount of data sets. Although various online NMF algorithms like the algorithm [6], have been proposed that can detect latent factors and track their evolution with new data arrival, none of them are suitable to be applied to short texts. To address this issue we incorporate a variant of the online NMF algorithm of [21] in OTDA, grounded in the framework of SeaNMF [19], to discover topics from very large scale/streaming short texts.

The OTDA algorithm works with one data point or one chunk of data points instead of storing the whole data in the memory. Further it updates the topic representation in an underlying space as well as context representation in terms of words on arrival of new data stream, by employing Projected Gradient Descent (PGD) algorithm in both the steps. Admission of context information improves the quality of incremental topic modeling as it can capture the semantics of the short text corpus based on word-document and word-context correlations, thus overcoming the problem of lacking word co-occurrence in short texts. To highlight the adaptivity of our learning algorithm we introduce a decay factor that exponentially reduces the contribution of history data, thereby imposing a forgetting (memorylessness) mechanism on the topic discovery process [5]. We design experiments to investigate the effect of the forgetting mechanism, and the results show that one needs to forget to adapt, that is, in absence of decay parameters the quality of generated topics suffers (NMI values go down) as new data points arrive, and richer topics are generated for streaming data when decay factors are present.

Contribution of This Work. This work has contributed to the body of online topic discovery in several ways. The topic learning incorporates word-context semantic correlation from SeaNMF model, however, it uses the framework of distributed clustering algorithm [22] without an increase in computational overhead and memory requirements. This algorithm has noticeable speed-up as the topic computation is done locally via a reduction in the memory footprint. Also like other online applications, we allow memorylessness with our method by introducing decay factors in the computation which causes the past history to be forgotten at exponential rate and attaches more importance to the current set of data. Extensive experimentation on real-life data sets produce interesting results on metrics, such as average Frobenius loss, Topic Coherence, NMI and emergent topic detection.

Organization: The paper is organized as follows. In the next section (Subsect. 1.1) we discuss the current literature related to this work. In Sect. 2 we review background material on NMF concepts and related matters. We propose our online topic discovery algorithm in Sect. 3. We discuss the data sets and the metrics used for experimentation in Sect. 4 and 5 respectively. The results of the experiments are furnished in Sect. 6. Finally we conclude in Sect. 7.

1.1 Related Work

Two groups of topic models are frequently employed to automatically extract topical contents from the documents, generative probabilistic models such as PLSA [10], LDA [3], and non-negative matrix factorization (NMF) [24]. They normally work well for lengthy documents. However these techniques do not produce meaningful results for short texts as term document matrix is very sparse which produces scarce word co-occurrence information and hence, generates poor quality topics [7, 19]. There are lot of methods proposed in recent times to tackle this problem. These include aggregating short texts into pseudo-documents, and extracting cross document co-occurrence [16, 27] using internal semantic relationship between words. While a pseudo-document generated in the first approach may contain many irrelevant short texts, noise and bias can creep in due to adoption of Wikipedia-centric notions of semantics in the second approach. To alleviate these problems, Shi et al. have proposed a novel semantics-assisted NMF (SeaNMF) model for short texts which incorporates word-context semantic correlations learned from the skip-gram view of the corpus [19]. Rest of the discussion on relevant prior art is divided into two parts, online topic discovery and online NMF.

Online Topic Discovery. In one of the earlier work on online topic modeling based on LDA Blei et al. [2] develop a family of probabilistic time series models in order to analyze the time evolution of topics in documents. Another LDA-based model is proposed in [23] to model a topic as a continuous distribution over timestamps and the mixture distribution as a function of both word co-occurrences and the document’s timestamp. AlSumait et al. [1] introduce a topic modeling framework based on the LDA model to make it work in an online fashion such that it incrementally builds an up-to-date model (mixture of topics per document and mixture of words per topic) as a set of documents appear. The authors [11] propose another online topic model for sequentially analyzing time evolution of topic along multi-scales in a large collection of documents. Some online topic models have been also proposed for short texts like tweet data, such as [18] wherein the authors model the generation process of tweets by estimating the ratio between topic words and general words for each user.

Online NMF. We do not come across any work which uses NMF to present online topic model, hence we discuss few pieces of works related to online NMF. Cao et al. have proposed an online NMF which finds two factor matrices to approximate the whole data matrix [6]. Although it performs well in practice it cannot be applied to large-scale or streaming data sets due to the memory limitations. Bucak and Gusel have proposed an incremental NMF [5] in which the term topic matrix at \((t+1)\)th step is updated on the arrival of \((k+1)\)th sample. It has been seen that this works well in practice but, it is time consuming as the updation of rules have slow convergence. Zhou et al. has proposed another variant of incremental NMF with volume constraint [26]. In [8] Guan and Tao propose an efficient online NMF algorithm that learns NMF in an incremental fashion using robust stochastic approximation. In [21] an online NMF algorithm has been proposed for efficient document clustering for very large and streaming data sets. The proposed algorithm in this paper is an improvement of this algorithm in the sense that we consider word context correlation in the model and incorporate decay factors that cause the past history to be forgotten at an exponential rate.

2 Basic NMF Model for Topic Discovery

In this section we discuss basic NMF method, its application to topic modeling and the recently proposed SeaNMF method [19] for short texts.

Notation:  Let \({\mathbb {R}}\) denote the set of real numbers (or reals), \({\mathbb {R}}_{+}\) the set of non-negative real numbers and \({\mathbb {N}}\) the set of natural numbers. \(\pmb {x} \in {\mathbb {R}}^n\) denotes an n-dimensional vector of reals. \(\pmb {1}_K\) denotes a row vector of size 1 whose all elements are 1. Also \(\left\Vert \pmb {x}\right\Vert _1\) and \(\left\Vert \pmb {x}\right\Vert _2\) denote the \(\ell _{1}\) and \(\ell _{2}\) norms of vector \(\pmb {x}\) respectively. We use the notation \(\mathbf {X} \in {\mathbb {R}}^{p \times q}\) to denote a matrix of real numbers having p and q number of rows and columns respectively (or having dimension \(p \times q\)). We denote the elements of a matrix \(\mathbf {X} \in {\mathbb {R}}_+^{p \times q}\) as \([x_{ij}]_{\{ 1 \le i \le p, 1 \le j \le q \}}\). We use \(\pmb {X}_{i \cdot }\) and \(\pmb {X}_{\cdot j}\) to denote the ith row vector and the jth column vector of matrix \(\mathbf {X}\). In some cases the column vector \(\pmb {X}_{\cdot j}\) will be also denoted as \(\pmb {x}_j\) as before. Further \(\left\Vert \mathbf {X}\right\Vert _F^2\) denotes the sum of the squared elements in the matrix \(\mathbf {X}\) (also called the Frobenius norm). The zero matrix \(\pmb {0}\) has all zero entries with its dimension to be read off from its context.

Basic NMF Model. The problem of Non-Negative Matrix Factorization (NMF) deals with factoring a given matrix into two non-negative matrices [13, 24]. Given an input matrix \(\mathbf {X} \in {\mathbb {R}}_+^{m \times n}\), an integer \(K \ll \min (m, n)\), NMF tries to solve a lower-rank approximation,  \(\mathbf {X} \approx \mathbf {U} \mathbf {V}^{T}\). where \(\mathbf {U} \in {\mathbb {R}}_+^{m \times K}\) and \(\mathbf {V}\in {\mathbb {R}}_+^{n \times K}\) are factor matrices. This is done by considering the optimization problem that minimizes the following objective function/loss function (also called the error of approximation or the Frobenius loss):

$$\begin{aligned} \min {{{\mathcal {L}}}(\mathbf {U},\mathbf {V}) \left( = \frac{1}{2} \left\Vert \mathbf {X} - \mathbf {U} \mathbf {V}^{T} \right\Vert _F^2 \right) }, \,\, \text{ s.t. }, \mathbf {U} \ge 0, \mathbf {V} \ge 0 \end{aligned}$$
(1)

Popular algorithms for solving the NMF problem with Frobenius loss as given by Eq. 1 are Multiplicative Update Rule (MUA) [14], Blockwise Co-ordinate Descent (BCD) [12], Projected Gradient Method (PGD) [15] to name a few. We shall mainly adopt PGD [15] which follows alternative minimization principle.

Topic Discovery Using NMF.  In topic modeling, \(\mathbf {X} \in {\mathbb {R}}_+^{m \times n}\) is called the term-document matrix where we assume a given corpus with n documents and m terms. \(\pmb {X}_{\cdot l} \in {\mathbb {R}}_+^{l}\) represents the l-th column vector of \(\mathbf {X}\), which corresponds to the bag-of-words representation of document l with respect to m terms, possibly using TF*IDF weight after some pre-processing, and column-wise \(\ell _{2}\) normalization. For solving the minimization problem in Eq. 1 one assumes a pre-determined number of topics K.

Topic Modeling for Short Texts Using NMF. As short texts are sparse and consists of only a few terms many unrelated documents may lead to biased relationship between terms resulting in poor clustering (and topic extraction). Moreover, most of the algorithms for solving NMF fail to appropriately discover the relationship between terms and their contexts. To overcome this problem the authors in [19] propose a novel semantics-assisted NMF (SeaNMF) model to learn topics from short texts.

The SeaNMF approach is based on the idea that terms are dependent on contexts as they appear around them. Towards this the authors define term-context correlation matrix \(\mathbf {R}\) [19] using Skip-gram view of the corpus in the presence of an M-dimensional context vector \(\pmb {c}\):

$$\begin{aligned} {r}_{ij} = \max {\left[ \log \left( \frac{\#(t_i,c_j)}{\#(t_i) \cdot p(c_j)} \right) - \log \kappa , 0 \right] }, 1 \le i \le m, 1 \le j \le M \end{aligned}$$
(2)

We use \({\mathbb {V}}\) to denote the the overall vocabulary of terms and contexts. The notation \(\#(t_i,c_i)\) denotes the number of times \(t_i\) appears with context \(c_i\) in text corpora. Further \(\#(t_i) = \sum _{c_j \in {\mathbb {V}}} \#(t_i,c_j)\) and \(\#(c_j) = \sum _{t_i \in {\mathbb {V}}} \#(t_i, c_j)\) represent the number of times \(t_i\) and \(c_j\) occur in all possible term-context pairs respectively, and \(\kappa \) is the number of negative samples. Finally, \(p(c_j)\) is a unigram distribution for sampling a context \(c_j\) defined as \(p(c_j) = \frac{\#(c_j)}{\sum _{c_j \in {\mathbb {V}}} \#(c_j)}\). There are a few techniques to specify the sliding window for a context [19]. For example, each document can be selected as a window of context [19] for a term in short text corpus or it can be a long pseudo-text obtained by aggregating short texts belonging to a cluster. A fixed size window of neighboring words can act as a context for a word, and so on.

Finally, SeaNMF proceeds in two step. In the first step the term-context correlation matrix \(\mathbf {R}\) is factored into two matrices, term-topic matrix \(\mathbf {U} \in {{\mathbb {R}}_{+}}^{m \times K}\) and another newly introduced matrix context topic matrix \(\mathbf {U}_c \in {{\mathbb {R}}_{+}}^{M \times K}\). In the second step the term document matrix \(\mathbf {X} \in {\mathbb {R}}_+^{m \times n}\) is factored along with the term-topic matrix to obtain the document-topic matrix \(\mathbf {V}\in {\mathbb {R}}_+^{n \times K}\) (some sparsity constraint may be imposed on \(\mathbf {X}\) in the process). For details the reader is advised to consult [19].

The computational complexity of SeaNMF for short texts is same as the computational complexity of standard NMF [12] using MUA or BCD method and is equal to O(nmK) for single iteration and O(TnmK) for T iterations (assuming \(K \ll \min (n,m))\). However, as \(\mathbf {R}\) and \(\mathbf {X}\) are sparse matrices the authors conclude that the complexity of the SeaNMF model using BCD is \(O(zK)\,\,\) (O(TzK)) for single (T) iteration(s), where \(z = \max {(z_{\mathbf {R}},z_{\mathbf {X}})}\), and \(z_{\mathbf {R}}\) and \(z_{\mathbf {X}}\) are the non-zero elements in the matrices \({\mathbf {R}}\) and \({\mathbf {X}}\) respectively, \(\max {(z_{\mathbf {R}},z_{\mathbf {X}})} \ll mn\) and \(K \ll \min (m,n)\), which is less expensive than the standard NMF. Further it is required to hold the matrices \(\mathbf {X}\) and \(\mathbf {R}\) at a storage cost of O(mn) in the SeaNMF model.

3 Proposed Online Topic Modeling for Short Texts

We propose an online Topic Discovery (OTDA) algorithm that updates the matrices \(\mathbf {U}, \mathbf {U}_c\) and \(\mathbf {V}\) by adding the effects of subsequent samples in an incremental fashion.

3.1 An Incremental Form of NMF

Note the loss function in Eq. 1 can be decomposed as [12]:

$$\begin{aligned} {\mathcal {L}}(\mathbf {U}, \mathbf {V}) = \left\Vert \mathbf {X} - \mathbf {U} {\mathbf {V}}^T\right\Vert _F^2 = \sum _{j=1}^{n} \left\Vert \mathbf {X}_{\cdot j} - \mathbf {U} {\mathbf {V}_{\cdot j}}^T\right\Vert _F^2 = \sum _{j=1}^{n} \left\Vert \pmb {x}_j - \mathbf {U} \pmb {v}_j\right\Vert _F^2 \end{aligned}$$
(3)

Consider the problem of generating K topics from the data set. The term topic matrix will look like \(\mathbf {U} = [ \pmb {u}_1 \cdots \pmb {u}_K]\) which represents each topic as the weighted combination of terms. Further \(\pmb {v}_j = [g_{j1} \cdots g_{jK}]^{T}\) are the reconstruction weights of \(\pmb {x}_j\) from these representatives.

When \(\mathbf {U}\) is fixed, the minimum value of \({\mathcal {L}}(\mathbf {U}, \mathbf {V})\) is reached if and only if the cost function \({\mathcal {L}}(\mathbf {U}, \pmb {v}_j) = \left\Vert \pmb {x}_j - \mathbf {U} \pmb {v}_j\right\Vert _F^2\) is minimized for all \(j, 1 \le j \le n\). Thus, one solves independent Non-negative Least Squares (NNLS) problems of the form,

$$\begin{aligned} \min _{\pmb {v}_j \ge 0} \left\Vert \pmb {x}_j - \mathbf {U} \pmb {v}_j\right\Vert _F^2, j = 1,2 \ldots n \end{aligned}$$
(4)

and aggregate the solution as \(\mathbf {V} = [\pmb {v}_1 \cdots \pmb {v}_n ]\).

3.2 Computing Document Representations

In this step we let the topic representation \(\mathbf {U}\) to be fixed. We solve the optimization problem in Eq. 5 to compute \(\pmb {v}^{(t)}\):

$$\begin{aligned} \min \frac{1}{2} \left( \left\Vert \pmb {x}^{(t)} - \mathbf {U} \pmb {v}^{(t)}\right\Vert _F^2 + \lambda \left\Vert \pmb {v}^{(t)}\right\Vert _1^2 \right) \, \mathrm {s.t.}, \pmb {v}^{(t)} \ge 0, \mathbf {U} \,\, \text{ is } \text{ given } \end{aligned}$$
(5)

where \(\lambda > 0\) is a constant. We also impose the sparsity on \(\pmb {v}^{(t)}\) by adding a suitable \(\ell _{1}\) norm on it. The NNLS problem given by Eq. 5 is the so-called Lasso problem [20] which can be solved using Projected Gradient (PGD) [15] with the gradient computed as: \(\frac{\partial {{\mathcal {L}}^{(t)}}}{\partial {\pmb {v}^{(t)}}} = - {(\pmb {x}^{(t)})}^T \mathbf {U} + {(\mathbf {U} \pmb {v}^{(t)})}^T \mathbf {U} + \lambda \pmb {1}_K^T\).

3.3 Solving for Context Representation

In this step we try to compute the context representation of term in an incremental fashion. We assume that the M-dimensional context vector \(\pmb {c}^{(t)}\) is available at time instant t, this can be invariant with time or can be learned incrementally as new samples arrive, e.g., it can be learned online as a cluster of data points for streaming data [25]. Thus at time point t we can compute the term context correlation matrix \({\mathbf {R}}^{(t)}\) with the aid of current context information \(\pmb {c}^{(t)}\) using Eq. 2.

Now we solve for the underlying representation of context in the form of context-topic matrix \(\mathbf {U}_c^{(t)}\) by minimizing the following cost function in Eq. 6 keeping \(\mathbf {U}\) as constant. Also below, we impose the condition that the computed \({\mathbf {U}_c^{(t)}}\) will be dense by using a \(\ell _2\)-regularization term for it, where \(\beta > 0\) is a constant. Again this NNLS can be solved using a standard optimization algorithm.

$$\begin{aligned} \frac{1}{2} \left\Vert {\mathbf {R}}^{(t)} - \mathbf {U} ({\mathbf {U}_c^{(t)}})^T\right\Vert _F^2 + \beta \left\Vert \mathbf {U}_c^{(t)}\right\Vert _F^2\,\, \text{ s.t. },\,\, \mathbf {U}_c^{(t)} \ge 0, \,\, \mathbf {U} \,\, \text{ is } \text{ given } \end{aligned}$$
(6)

3.4 Updating Topic Representations

The topic represented in the form of term-topic matrix \(\mathbf {U}\) is updated in this step. At time instant t, as \(\pmb {x}^{(t)}\) arrives, OTDA first solves for \(\pmb {v}^{(t)}\) and \({\mathbf {U}_c^{(t)}}\) using \(\mathbf {U}^{(t-1)}\), and then updates \(\mathbf {U}\) by minimizing the following loss function:

$$\begin{aligned} {\mathcal {L}}^{(t)}(\mathbf {U}^{(t)}) = \left[ \frac{{\gamma }_0}{2} \sum _{s=1}^{t} \mu \left\Vert \mathbf {R}^{(s)} - \mathbf {U}^{(t)} {\mathbf {U}_c^{(s)}}^T\right\Vert _F^2 + \sum _{s=1}^{t} \frac{{\gamma }_s}{2} \left\Vert \pmb {x}^{(s)} - {\mathbf {U}}^{(t)} \pmb {v}^{(s)}\right\Vert _F^2 \right] \end{aligned}$$
(7)

under the constraints \({\mathbf {U}}^{(t)} \ge 0\). Further \(\pmb {v}^{(s)}\) is obtained as a solution of the minimization problem given in Eq. 5, and \(\mathbf {U}^{(s)}_{c}\) is found by solving Eq. 6.

We introduce decay factors [5] to ensure that the effects of new samples on the representation is higher, while that of old ones wane (memorylessness). That is, \({\gamma }_0, {\gamma }_s\, (s = 1,2 \ldots , t)\) are the decay factors which cause the past history to be forgotten at an exponential rate. We define,

$$ \begin{array}{cccc} {\gamma }_j &{} = &{} {{\gamma }_0}^{(t - 2r)}, &{} j \le 2r\\ &{} = &{} {{\gamma }_0}^{(t - j)}{\gamma }_f, &{} 2r < j \le t \end{array} $$

We assume \({\gamma }_0< 1 ({\gamma }_0 \approx 0.5), {\gamma }_f < 1 ({\gamma }_f \approx 0.9)\) and \(r =1\).

The gradient of \({\mathcal {L}}^{(t)}\) wrt \(\mathbf {U}^{(t)}\) is given by

$$\begin{aligned} \nabla _{\mathbf {U}^{(t)}} \left( {\mathcal {L}}^{(t)} (\mathbf {U}^{(t)})\right) = - {\gamma }_0 \sum _{s=1}^{t} \mu [ \pmb {R}^{(s)} {\mathbf {U}_c^{(s)}} - \mathbf {U}^{(t)} {\mathbf {U}_c^{(s)}}^T \mathbf {U}_c^{(s)}]\qquad \qquad \qquad \nonumber \\ - \sum _{s=1}^{t} [{\gamma }_s ( \pmb {x}^{(s)} {\pmb {v}^{(s)}}^T - \mathbf {U}^{(t)} \pmb {v}^{(s)} {\pmb {v}^{(s)}}^T)] \end{aligned}$$
(8)

One can update \(\mathbf {U}^{(t)}\) using PGD assuming an initial value of \(\mathbf {U}_0^{(t)}\). However, when we implement the first-order PGD we do not get quality results as expected, because there are some known drawbacks for the first-order PGD, for instance, large step size in the update leads to slow convergence etc. Hence we use second order PGD for which we compute the Hessian matrix of \({\mathcal {L}}^{(t)}\) wrt \(\mathbf {U}^{(t)}\),

$$\begin{aligned} {\mathcal {H}}_{\mathbf {U}^{(t)}} \left( {\mathcal {L}}^{(t)}(\mathbf {U}^{(t)})\right) = 2 \sum _{s=1}^{t} \left[ \mu \cdot \gamma _0 \cdot {\mathbf {U}_c^{(s)}}^T \mathbf {U}_c^{(s)} + {\gamma }_s \pmb {v}^{(s)} {\pmb {v}^{(s)}}^T \right] \end{aligned}$$
(9)

Finally we adopt the following update rule for the second order PGD that can guarantee faster convergence without using any parameter:

$$\begin{aligned} \mathbf {U}_{k+1}^{(t)} = {\mathcal {P}}\left[ \mathbf {U}_{k}^{(t)} - \nabla _{\mathbf {U}^{(t)}} \left( {\mathcal {L}}^{(t)} (\mathbf {U}_{k}^{(t)})\right) {\mathcal {H}}^{-1}_{\mathbf {U}^{(t)}} \left( {\mathcal {L}}^{(t)}(\mathbf {U}_{k}^{(t)})\right) \right] \end{aligned}$$
(10)

where \({\mathcal {H}}^{-1}\) is the inverse of the Hessian matrix \({\mathcal {H}}\). As the computation of \({\mathcal {H}}^{-1}\) matrix is time consuming we adopt Conjugate Gradient to calculate it. The second-order PGD has been shown in Algorithm 1. For notational convenience we introduce the following first-order and second-order terms respectively.

$$\begin{aligned}&\mathbf {W}^{(t)} = \sum _{s=1}^{t} \left[ {\gamma }_0 \cdot \mu \cdot \pmb {R}^{(s)} {\mathbf {U}_c^{(s)}} + {\gamma }_s \cdot \pmb {x}^{(s)} {\pmb {v}^{(s)}}^T \right] \end{aligned}$$
(11)
$$\begin{aligned}&\mathbf {H}^{(t)} = \sum _{s=1}^{t} \left[ {\gamma }_0 \cdot \mu \cdot {\mathbf {U}_c^{(s)}}^T \mathbf {U}_c^{(s)} + {\gamma }_s \cdot \pmb {v}^{(s)} {\pmb {v}^{(s)}}^T) \right] \end{aligned}$$
(12)
figure a

3.5 Online Topic Discovery

Using second-order PGD we can design an online algorithm for topic discovery for short texts. This algorithm procedure can be performed using one pass and multiple passes. The complete one-pass algorithm is mentioned in Algorithm 2.

This algorithm follows mini-batch implementation [4] which is at the confluence of Stochastic Gradient Descent and the traditional batch descent algorithms. As this algorithm imports p data points at each step, the OTDA algorithm can be expected to converge faster. Consequently the update rules for \(\mathbf {W}^{(t)}\) and \(\mathbf {H}^{(t)}\) are given by,

$$\begin{aligned}&\mathbf {W}^{(t)} = \mathbf {W}^{(t-1)} + \sum _{i=1}^{p} \left[ {\gamma }_0 \cdot \mu \cdot \pmb {R}^{(t,i)} {\mathbf {U}_c^{(t,i)}} + {\gamma }_{t} \cdot \pmb {x}^{(t,i)} {\pmb {v}^{(t,i)}}^T \right] \end{aligned}$$
(13)
$$\begin{aligned}&\mathbf {H}^{(t)} = \mathbf {H}^{(t-1)} + \sum _{i=1}^{p} \left[ {\gamma }_0 \cdot \mu \cdot {\mathbf {U}_c^{(t,i)}}^T \mathbf {U}_c^{(t,i)} + {\gamma }_t \cdot \pmb {v}^{(t,i)} {\pmb {v}^{(t,i)}}^T) \right] \end{aligned}$$
(14)

Notice that we do not recompute \(\mathbf {W}^{(t)}\) and \(\mathbf {H}^{(t)}\) afresh each time. Rather we update \(\mathbf {W}^{(t)}\) and \(\mathbf {H}^{(t)}\) by using Eqs. 13 and 14 respectively. Although only a single pass over the data seems to be feasible in data stream applications, multiple passes can be run in many applications. In the multi-pass OTDA, the document topic assignment matrix \(\mathbf {V}\) can be updated using term-topic matrix \(\mathbf {U}\). Moreover, the first and second order information \(\mathbf {W}\) and \(\mathbf {H}\) in the previous pass can be updated and utilized. When multiple passes are feasible one can expect to obtain more accurate results.

figure b

3.6 Computational Savings

As the OTDA proceeds by solving Eqs. 56 and 7 it incurs computational cost of O(mnK), O(mmK) and O(nmK) at each of these steps respectively. However, since \(\mathbf {X}\) and \(\mathbf {R}\) are sparse matrices, we only need to multiply the non-zero elements with factor matrices. Hence the cost for these operations will be \(O(z_{\mathbf {X}}K), O(z_{\mathbf {R}}K)\) and O(zK), where \(z = \max {(z_{\mathbf {R}},z_{\mathbf {X}})}\) for single iterationFootnote 1. The proposed OTDA will therefore will have a cost of O(zK) for single iteration. The Frobenius loss of OTDA is frequently very close to the Frobenius loss of the SeaNMF algorithm after \(T \le 2\) iterations as witnessed by our experimentation, which will save computational cost appreciably (\(\approx O(zK)\) cost only). Also our one-pass OTDA needs to only load the data matrix once which involves low IO cost. Our experiment results shows that we often do not need many passes to obtain very accurate results.

3.7 Topic Detection and Tracking

Our dynamic topic model enables capturing the topics and their evolution over time. The vector \(\mathbf {U}^{(t)}_{\cdot k}\) portrays the evolution of topic k at time t. As each topic is represented in the form of a column vector, represented as a weighted combination of terms the dissimilarity between the representation of a topic k at time point \(t+1\) and t, is defined as \(\mathrm{Dist}(k,t) = \left\Vert \mathbf {U}^{(t+1)}_{\cdot k} - \mathbf {U}^{(t)}_{\cdot k}\right\Vert _2\). We consider a topic to be emerging if it is different from its peers in the same stream, or from all the topics seen so far. The identification of emerging topics can be modeled by considering the K topic distances computed at time t using a confidence level \(\mathrm{CL}\). Then we use the algorithm in [1] (Sect. 3.2) to compute nominated emerging topics in which the function \(\mathrm{Edetect(CL)}\) returns the emerging topics \(\mathrm{Etopics}(t)\) generated in the time slice t and \((t+1)\).

4 Data Sets for Experimentation

We have considered four sets of short text data for experimental purposes, three of which are public datasets and the fourth is an internal data set. Public data sets are Yahoo manner, SearchSnippets and StackOverflow. Yahoo manner data set (Yahoo) is a subset of the Yahoo Answers Manner Questions, version 2.04Footnote 2. The data set SearchSnippets (Snippets) is selected after searching through the transactions on the web using predefined phrases of 8 different domains. StackOverflow (Stack) is the challenge data set published onlineFootnote 3. The fourth data set (Optum) contains feedback texts that are provided by customers (from an offshore center of Optum) in certain healthcare domains. Three public data sets are labeled with categories, for which we generate the same number of topics. For Optum feedback texts we assume 9 topics by using the standard criterion of selecting optimal number of clusters.

Table 1. Statistics of data sets considered

Some basic statistics of these data sets are shown in Table 1. ‘#docs’ represents the number of documents in each data set, and ‘#terms’ the number of terms in the vocabulary. The quantity ‘density’ is defined as \(\dfrac{\#\text{ non-zero }}{\#\text{ docs }\cdot \#\text{ terms }}\), where #non-zero is the number of non-zero elements in the matrix. The entities ‘density(\(\mathbf {X}\))’ and ‘density(\(\mathbf {R}\))’ represent the density of term-document matrix \(\mathbf {X}\) and term-context correlation matrix \(\mathbf {R}\), respectively. ‘doc-length’ represents the average length of the documents. ‘#cats’ denotes the number of distinct categories.

5 Evaluation Metrics

We present an evaluation of our approach by comparing the performance of our online topic discovery algorithm with other relevant algorithms on three characteristics, average Frobenius loss [14, 21], Topic Coherence (Coherence) [17] and Normalized Mutual Information (NMI) [7]. As a topic can be related to a cluster we use a cluster-related metric Normalized Mutual Information (NMI) to measure the efficacy of our method, especially for labeled data. Due to which, it is not possible to compute NMI values for Optum dataset.

For comparison with our OTDA on average Frobenius loss, we use the work on clustering using online NMF due to Wang et al. [21] (ClusterONMF). There is an old work of online NMF for latent factor tracking due to Cao et al. [6] (LatentONMF), however it is shown that ClusterNMF performs better than LatentONMF in terms of average Frobenius loss [21], and hence we do not consider LatentONMF in our experimentation. When we compare our OTDA using Coherence and NMI we use three baseline methods other than ClusterONMF, - adaptive Online-LDA (A-OLDA) [1], Online Learning for LDA (L-OLDA) [9] and Dynamic Topic Model (DTM) [2].

6 Experimental Results

We present experimental results on the data sets discussed before. For the benefit of reproducible research we upload all our codes and the baseline methods on https://github.com/varma-ds/OTDA. We have tweaked parameters appearing in loss functions in Sect. 3, but they do not have much effect on the results. So, we use default hyperparameter settings for each of the baselines. We use Scikit-learn’s online LDA implementationFootnote 4 for L-OLDA and Gensim’s LdaSeqModelFootnote 5 implementation for DTM. For L-OLDA and A-OLDA we use document topic prior value as 1/K.

6.1 OTDA with Conjugate Gradient

We mainly focus on OTDA with second order methods using conjugate gradient method. The performance of OTDA with first order PGD is not satisfactory, and hence is not presented (for space constraints).

Fig. 1.
figure 1

(a) Average Frobenius loss for one-pass and two-pass with increasing number of batches (b) Average Frobenius loss with increasing number of passes on Yahoo dataset

For the below experiments, we assume that data is divided into different batches of some fixed size. For each batch, both term-document matrix and word context correlation matrix are generated using fixed vocabulary. For computing word-context correlation matrix each short text is considered as a context. Using this information word context correlation matrix is updated for each batch. Other context information like fixed size window of words, streaming text clusters [25] etc. can be also considered.

Fig. 2.
figure 2

NMI measured at the end of each pass on Yahoo data

We present the results for the average Frobenius loss for one-pass and two-pass with increasing number of batches. For the second pass we compute the average Frobenius loss using all the n data points. For the first pass, average Frobenius loss is calculated only for the data points seen so far. From Fig. 1(a), we can see that with only one pass of the algorithm, the average Frobenius loss increases at first and then starts decreasing as the number of batches increases. If two passes are allowed the average Frobenius loss remains almost constant, but it is smaller than the values in the first pass as it learns the topics from the initial batch only. All the data sets show almost similar pattern. Results in Fig. 1(b) indicate that the average Frobenius loss continues to decrease as we increase number of passes with diminishing returns for almost all data sets. Here we reproduce the results for only one initialization and omit results for other initializations for space constraints. We show results for Yahoo data set as other datasets exhibit similar patterns only.

We further compute the NMI values for the labeled data sets using OTDA and plot the results in Fig. 2 for Yahoo dataset. It shows that NMI continues to go up as we increase number of passes to an extent and then stabilizes.

6.2 Comparison with Online Methods

We now compare the performance of OTDA with ClusterONMF [6], A-OLDA [1], L-OLDA [9] and DTM [2] using the metrics topic coherence and NMI. Additionally, we compare Frobenius loss for both OTDA and ClusterNMF during learning. We publish the best score achieved by each of the models for all the datasets in Table 2.

Table 2. Performance of OTDA against baselines. (Best scores across different batch sizes and number of passes are chosen for each model)

Average Frobenius Loss. We compare our OTDA with ClusterONMF in terms of the average Frobenius loss (using Eq. 4.22 in [6]). We report the result for only one initialization and different batch sizes. Further we produce the results for only one-pass of the algorithm for obvious reasons. We compute the Frobenius loss given using Eq. 4.20 in [6] (at the final iteration) for each batch due to the method of Lee and Seung, which is shown as a dashed line (labeled by L-S) in Fig. 3, wherein which we report the average Frobenius loss for 3 different method on Yahoo dataset only. Similar behavior is observed in other 3 datasets as well, the description of which is omitted in this paper due to space constraint. In all the cases OTDA produces higher loss than ClusterONMF. It is expected as we minimize the Frobenius loss along with another term involving context information, that acts like a regularization term (see Eq. 7).

Fig. 3.
figure 3

Comparison of Average Frobenius loss on Yahoo

Topic Coherence. We compute topic coherence for all the data sets as shown in Table 2. For all of them OTDA performs better than all other baselines. While for Snippets, Stack and Optum datasets appreciable improvement of Coherence is observed for OTDA, Snippets data set shows marginal gain with both OTDA and DTM. Further, in Fig. 4(a) we observe that with increase in batch size, topic coherence values reduce as the models tend to assign more diverse and non-coherent words associated with topics.

NMI. Quantitative evaluation using NMI metric is conducted on the three data sets with label information, e.g., Yahoo, Snippets and Stack have the same number of clusters being equal to 8. Table 2 depicts the comparison of clustering for each method on three labeled data sets. Overall, OTDA always outperforms ClusterONMF in terms of NMI values. For Yahoo and Snippets data, OTDA shows an improvement of 5-8% in NMI values in comparison to ClusterONMF. On the other hand, DTM performs slightly better than OTDA on Snippets dataset. Figure 4(b) shows the comparison of different models on Yahoo dataset. We observe a competitive performance between OTDA and ClusterNMF with increasing batch size.

Fig. 4.
figure 4

Comparison of different models w.r.t. Topic Coherence and NMI on Yahoo data

6.3 Effect of Decay on Streaming Data

We now examine the effect of decay factors on the streaming data. For that we curate Yahoo data set as follows. We divide the data set into 4 groups and 2 types, characterized by the categories, that is, each type will contain exactly 4 distinct categories of data. Details of this curated data is shown in Table 3. We assume each group corresponds to one batch and data arrives in batches. Using this curated data we have experimented with and without decay factors in OTDA formulation. The results are presented in Fig. 5. It shows that in absence of decay factors when a new type of data arrives, NMI reduces. But, with the introduction of decay factors in OTDA, the algorithm is able to forget the past topic distributions and learn the new topic distributions.

Table 3. Curated Yahoo data
Fig. 5.
figure 5

Decay effect on NMI for Yahoo data

Fig. 6.
figure 6

Probability distribution and Distance of the topic Maths across different batch numbers (Trending regions are highlighted)

6.4 Emerging Topic Detection

To test the ability of OTDA to detect novel topics as they evolve, we create synthesized data by mixing Yahoo and Stack Overflow data sets from which we take 10 categories (all the categories from Yahoo and only 2 categories from Stack overflow) in the following manner: (1) we add 9 categories in equal proportions (i.e.p.) excluding the topic Maths; (2) we add all the 10 categories i.e.p. including Maths; (3) we repeat step 1 and 2 four times; and (4) in the 9th time instant we have added 9 categories i.e.p. excluding Maths. With this synthesized data, we are able to detect the topic Maths as an emerging one at 2nd, 4th, 6th and 8th time instances at 90% confidence level (Fig. 6). The detected Topic probability distribution of Maths is also presented in Fig. 6.

7 Conclusion

We have proposed an efficient online NMF algorithm for discovering topics from short texts which processes incoming data incrementally. There are several reasons for choosing NMF over LDA to design this online algorithm.

While our method advocates optimizing loss function directly, other variations of (LDA-based) online topic discovery algorithms using variational inference techniques produce approximations of the actual results. Further, all Markov chain Monte Carlo-based topic extractions (e.g., LDA) are asymptotically exact although computationally expensive. This makes our model a perfect fit for accurate as well as fast, scalable alternative to other topic models.