Keywords

1 Introduction

Topic modeling is a rapidly developing branch of statistical text analysis [1]. Topic model reveals a hidden thematic structure of a text collection and finds a compressed representation of each document in terms of its topics. Practical applications of topic models include many areas, such as information retrieval for long-text queries, classification, categorization, summarization and segmentation of texts. Topic models are increasingly used for non-textual and heterogeneous data including signals, images, video and networks. More ideas, models and applications are outlined in the survey [4].

From a statistical point of view, a probabilistic topic model (PTM) defines each topic by a multinomial distribution over words, and then describes each document with a multinomial distribution over topics. From an optimizational point of view, topic modeling can be considered as a special case of approximate stochastic matrix factorization. To learn a factorized representation of a text collection is an ill-posed problem, which has an infinite set of solutions. A typical approach in this case is to apply regularization techniques, which impose problem-specific constrains and ultimately lead to a better solution.

Modern literature on topic modeling offers hundreds of models adapted to different situations. Nevertheless, most of these models are too difficult for practitioners to quickly understand, adapt and embed into applications. This leads to a common practice of tasting only the basic out-of-date models such as Probabilistic Latent Semantic Analysis, PLSA [6] and Latent Dirichlet Allocation, LDA [3]. Most practical inconveniences are rooted in Bayesian learning, which is the dominating approach in topic modeling. Bayesian inference of topic models requires a laborious mathematical work, which prevents flexible unification, modification, selection, and combination of topic models.

In this paper we announce the BigARTM open source project for regularized multimodal topic modeling of large collections, http://bigartm.org. The theory behind BigARTM is based on a non-Bayesian multicriteria approach — Additive Regularization of Topic Models, ARTM [11]. In ARTM a topic model is learned by maximizing a weighted sum of the log-likelihood and additional regularization criteria. The optimization problem is solved by a general regularized expectation-maximization (EM) algorithm, which can be applied to an arbitrary combination of regularization criteria. Many known Bayesian topic models were revisited in terms of ARTM in [12, 13]. Compared to the Bayesian approach, ARTM makes it easier to design, infer and combine topic models, thus reducing the barrier for entering into topic modeling research field.

BigARTM source code is released under the New BSD License, which permits free commercial and non-commercial usage. The core of the library is written in C++ and is exposed via two equally rich APIs for C++ and Python. The library is cross-platform and can be built for Linux, Windows and OS X in both 32 and 64 bit configuration. In our experiments on Wikipedia corpus BigARTM performs better than Vowpal Wabbit LDA and Gensim libraries in terms of perplexity and runtime. Comparing to the other libraries BigARTM offers several additional features, such as regularization and multi-modal topic modeling.

The rest of the paper is organized as follows. In Sect. 2 we introduce a multimodal topic model for documents with metadata. In Sect. 3 we generalize the fast online algorithm [5] to multimodal ARTM. In Sect. 4 we describe parallel architecture and implementation details of the BigARTM library. In Sect. 5 we report results of our experiments on large datasets. In Sect. 6 we discuss advantages, limitations and open problems of BigARTM.

2 Multimodal Regularized Topic Model

Let D denote a finite set (collection) of texts and \(W^1\) denote a finite set (vocabulary) of all terms from these texts. Each term can represent a single word or a key phrase. A document can contain not only words, but also terms of other modalities. Each modality is defined by a finite set (vocabulary) of terms \(W^m\), \({m=1,\ldots ,M}\). Examples of not-word modalities are: authors, class or category labels, date-time stamps, references to/from other documents, entities mentioned in texts, objects found in the images associated with the documents, users that read or downloaded documents, advertising banners, etc.

Assume that each term occurrence in each document refers to some latent topic from a finite set of topics T. Text collection is considered to be a sample of triples \((w_i,d_i,t_i)\), \({i=1,\ldots ,n}\), drawn independently from a discrete distribution p(wdt) over the finite space \(W\times D \times T\), where \({W=W^1\sqcup \cdots \sqcup W^m}\) is a disjoint union of the vocabularies across all modalities. Terms \(w_i\) and documents \(d_i\) are observable variables, while topics \(t_i\) are latent variables.

Following the idea of Correspondence LDA [2] and Dependency LDA [9] we introduce a topic model for each modality:

$$\begin{aligned} p(w\,{|}\,d) = \sum _{t\in T} p(w\,{|}\,t)\, p(t\,{|}\,d) = \sum _{t\in T} \phi _{wt} \theta _{td}, \quad d\in D,\; w\in W^m,\; m=1,\ldots ,M. \end{aligned}$$

The parameters \({\theta _{td}=p(t\,{|}\,d)}\) and \({\phi _{wt}=p(w\,{|}\,t)}\) form matrices \({\varTheta = \bigl ( \theta _{td} \bigr )_{T\times D}}\) of topic probabilities for the documents, and \({\varPhi^m = \bigl ( \phi _{wt} \bigr )_{W^m\times T}}\) of term probabilities for the topics. The matrices \(\varPhi^m\), if stacked vertically, form a \({W\!\!\times \!T}\)-matrix \(\varPhi\). Matrices \(\varPhi^m\) and \(\varTheta \) are stochastic, that is, their vector-columns represent discrete distributions. Usually |T| is much smaller than |D| and |W|.

To learn parameters \(\varPhi^m\), \(\varTheta \) from the multimodal text collection we maximize the log-likelihood for each m-th modality:

$$\begin{aligned} \mathscr {L}_m (\varPhi^m,\varTheta ) = \sum _{d\in D}\sum _{w\in W^m} n_{dw} \ln p(w\,{|}\,d) \rightarrow \max _{\varPhi^m,\varTheta }, \end{aligned}$$

where \(n_{dw}\) is the number of occurrences of the term \(w\in W^m\) in the document d. Note that topic distributions of documents \(\varTheta \) are common for all modalities. Following the ARTM approach, we add a regularization penalty term \(R(\varPhi,\varTheta )\) and solve a constrained multicriteria optimization problem via scalarization:

$$\begin{aligned}&\qquad \sum _{m=1}^M \tau _m \mathscr {L}_m (\varPhi^m,\varTheta ) + R(\varPhi,\varTheta ) \rightarrow \max _{\varPhi,\varTheta }; \end{aligned}$$
(1)
$$\begin{aligned}&\sum _{w\in W^m}\!\!\! \phi _{wt} = 1,~ \phi _{wt}\geqslant 0; \qquad \sum _{t\in T} \theta _{td} = 1,~ \theta _{td}\geqslant 0. \end{aligned}$$
(2)

The local maximum \((\varPhi,\varTheta )\) of the problem (1), (2) satisfies the following system of equations with auxiliary variables \(p_{tdw} = p(t\,{|}\,d,w)\):

$$\begin{aligned} p_{tdw}&= \mathop {\text {norm}}\limits _{t\in T} \bigl (\phi _{wt}\theta _{td}\bigr ); \end{aligned}$$
(3)
$$\begin{aligned} \phi _{wt}&= \mathop {\text {norm}}\limits _{w\in W^m} \biggl ( n_{wt} + \phi _{wt} \frac{\partial R}{\partial \phi _{wt}} \biggr ); \quad n_{wt} = \sum _{d\in D} n_{dw} p_{tdw}; \end{aligned}$$
(4)
$$\begin{aligned} \theta _{td}&= \mathop {\text {norm}}\limits _{t\in T} \biggl ( n_{td} + \theta _{td} \frac{\partial R}{\partial \theta _{td}} \biggr ); \quad n_{td} = \sum _{w\in d} \tau _{m(w)} n_{dw} p_{tdw}; \end{aligned}$$
(5)

where operator \(\mathop {\text {norm}}\limits _{t\in T} x_t = \frac{\max \{x_t,0\}}{\sum \limits _{s\in T} \max \{x_s,0\}}\) transforms a vector \((x_t)_{t\in T}\) to a discrete distribution; m(w) is the modality of the term w, so that \(w\in W^{m(w)}\).

The system of Eqs. (3)–(5) follows from Karush–Kuhn–Tucker conditions. It can be solved by various numerical methods. Particularly, the simple-iteration method is equivalent to the EM algorithm, which is typically used in practice. For single modality (\({M=1}\)) it gives the regularized EM algorithm proposed in [11]. With no regularization (\({R=0}\)) it corresponds to PLSA [6].

Many Bayesian topic models can be considered as special cases of ARTM with different regularizers R, as shown in [12, 13]. For example, LDA [3] corresponds to the entropy smoothing regularizer.

Due to the unified framework of additive regularization BigARTM can build topic models for various applications simply by choosing a suitable combination of regularizers from a build-in user extendable library.

3 Online Topic Modeling

Following the idea of Online LDA [5] we split the collection D into batches \(D_b\), \({b=1,\ldots ,B}\), and organize EM iterations so that each document vector \(\theta _d\) is iterated until convergence at a constant matrix \(\varPhi\), see Algorithms 1 and 2. Matrix \(\varPhi\) is updated rarely, after all documents from the batch are processed. For a large collection matrix \(\varPhi\) often stabilizes after small initial part of the collection. Therefore a single pass through the collection might be sufficient to learn a topic model.

Algorithm 1 does not specify how often to synchronize \(\varPhi\) matrix at steps 5–8. It can be done after every batch or less frequently (for instance if \(\frac{\partial R}{\partial \phi _{wt}}\) takes long time to evaluate). This flexibility is especially important for concurrent implementation of the algorithm, where multiple batches are processed in parallel. In this case synchronization can be triggered when a fixed number of documents had been processed since the last synchronization.

The online reorganization of the EM iterations is not necessarily associated with Bayesian inference used in [5]. Different topic models, from PLSA to multimodal and regularized models, can be learned by the above online EM algorithm.

4 BigARTM Architecture

The main goal for BigARTM architecture is to ensure a constant memory usage regardless of the collection size. For this reason each \(D_b\) batch is stored on disk in a separate file, and only a limited number of batches is loaded into the main memory at any given time. The entire \(\varTheta \) matrix is never stored in the memory. As a result, the memory usage stays constant regardless of the size of the collection.

figure a
figure b

Concurrency. An general rule of concurrency design is to express parallelism at the highest possible level. For this reason BigARTM implements a concurrent processing of the batches and keeps a single-threaded code for the \(\textsf {ProcessBatch}(D_b, \phi _{wt})\) routine.

To split collection into batches and process them concurrently is a common approach, introduced in AD-LDA algorithm [8], and then further developed in PLDA [15] and PLDA+ [7] algorithms. These algorithms require all concurrent workers to become idle before an update of the \(\varPhi\) matrix. Such synchronization step adds a large overhead in the online algorithm where \(\varPhi\) matrix is updated multiple times on each iteration. An alternative architecture without the synchronization step is described in [10], however it mostly targets a distributed cluster environment. In our work we develop an efficient single-node architecture where all workers benefit from the shared memory space.

Fig. 1.
figure 1

Diagram of key BigARTM components

To run multiple \(\textsf {ProcessBatch}\) in parallel the inputs and outputs of this routine are stored in two separate in-memory queues, locked for push and pop operations with spin locks (Fig. 1). This approach does not add any noticeable synchronization overhead because both queues only store smart pointers to the actual data objects, so push and pop operations does not involve copying or relocating big objects in the memory.

Smart pointers are also essential for lifecycle of the \(\varPhi\) matrix. This matrix is read by all processors threads, and can be written at any time by the merger thread. To update \(\varPhi\) without pausing all processor threads we keep two copies — an active \(\varPhi\) and a background \(\varPhi\) matrices. The active matrix is read-only, and is used by the processor threads. The background matrix is being built in a background by the merger thread at steps 6 and 7 of Algorithm 1, and once it is ready merger thread marks it as active. Before processing a new batch the processor thread gets the current active matrix from the merger thread. This object is passed via shared smart pointer to ensure that processor thread can keep ownership of its \(\varPhi\) matrix until the batch is fully processed. As a result, all processor threads keep running concurrently with the update of \(\varPhi\) matrix.

Note that all processor threads share the same \(\varPhi\) matrix, which means that memory usage stays at constant level regardless of how many cores are used for computation. Using memory for two copies of the \(\varPhi\) matrix in our opinion gives a reasonable usage balance between memory and CPU resources. An alternative solution with only one \(\varPhi\) matrix is also possible, but it would require a heavy usage of atomic CPU instructions. Such operations are very efficient, but still come at a considerable synchronization costFootnote 1, and using them for all reads and writes of the \(\varPhi\) matrix would cause a significant performance degradation for merger and processor threads. Besides, an arbitrary overlap between reads and writes of the \(\varPhi\) matrix eliminates any possibility of producing a deterministic result. The design with two copies of the \(\varPhi\) matrix gives much more control over this and in certain cases allows BigARTM to behave in a fully deterministic way.

The design with two \(\varPhi\) matrices only supports a single merger thread, and we believe it should handle all \(\tilde{n}_{wt}\) updates coming from many threads. This is a reasonable assumption because merging at step 6 takes only about \(O(|W|\cdot |T|)\) operations to execute, while \(\textsf {ProcessBatch}\) takes O(n |T| I) operations, where n is the number of non-zero entries in the batch, I is the average number of inner iterations in \(\textsf {ProcessBatch}\) routine. The ratio n / |W| is typically from 100 to 1000 (based on datasets in UCI Bag-Of-Words repository), and I is \(10 \ldots 20\), so the ratio safely exceeds the expected number of cores (up to 32 physical CPU cores in modern workstations, and even 60 cores of the Intel Xeon Phi co-processors).

Data Layout. BigARTM uses dense single-precision matrices to represent \(\varPhi\) and \(\varTheta \). Together with the \(\varPhi\) matrix we store a global dictionary of all terms \({w \in W}\). This dictionary is implemented as \(\textsf {std::unordered\_map}\) that maps a string representation of \({w \in W}\) into its integer index in the \(\varPhi\) matrix. This dictionary can be extended automatically as more and more batches came through the system. To achieve this each batch \(D_b\) contains a local dictionary \(W_b\), listing all terms that occur in the batch. The \(n_{dw}\) elements of the batch are stored as a sparse CSR matrix (Compressed Sparse Raw format), where each row correspond to a document \({d \in D_b}\), and terms w run over a local batch dictionary \(W_b\).

For performance reasons \(\varPhi\) matrix is stored in column-major order, and \(\varTheta \) in row-major order. This layout ensures that \(\sum _t \varPhi_{wt} \theta _{td}\) sum runs on contiguous memory blocks. In both matrices all values smaller than \(10^{-16}\) are always replaced with zero to avoid performance issues with denormalized numbersFootnote 2.

Programming Interface. All functionality of BigARTM is expressed in a set of \(\textsf {extern\;C}\) methods. To input and output complex data structures the API uses Google Protocol BuffersFootnote 3. This approach makes it easy to integrate BigARTM into any research or production environment, as almost every modern language has an implementation of Google Protocol Buffers and a way of calling \(\textsf {extern\;C}\) code (\(\textsf {ctypes}\) module for Python, \(\textsf {loadlibrary}\) for Matlab, \(\textsf {PInvoke}\) for C#, etc.).

On top of the \(\textsf {extern\;C}\) API BigARTM already has convenient wrappers in C++ and Python. We are also planning to implement a Java wrapper in the near future. In addition to the APIs the library also has a simple CLI interface.

BigARTM has built-in libraries of regularizers and quality measures that can be extended in current implementation only through project recompilation.

Basic Tools. A careful selection of the programming tools is important for any software project. This is especially true for BigARTM as its code is written in C++, a language that by itself offers less functionality comparing to Python, .NET Framework or Java. To mitigate this we use various parts of the Boost C++ Libraries, Google Protocol Buffers for data serialization, ZeroMQ library for network communication, and several other libraries.

BigARTM uses CMake as a cross-platform build system, and it successfully builds on Windows, Linux and OS X in 32 and 64 bit configurations. Building the library require a recent C++ compiler with C++11 support (GNU GCC 4.6.3, clang 3.4 or Visual Studio 2012 or newer), and Boost Libraries 1.46.1 or newer. All the other third-parties are included in BigARTM repository.

We also use free online services to store source code (https://github.com), to host online documentation (https://readthedocs.org) and to run automated continuous integration builds (http://travis-ci.org).

5 Experiments

In this section we evaluate the runtime performance and the algorithmic quality of BigARTM against two popular software packages — Gensim [14] and Vowpal WabbitFootnote 4. We also demonstrate some of the unique BigARTM features, such as combining regularizers and multi-language topic modeling via multimodality, which are not available in the other software packages.

All three libraries (VW.LDA, Gensim and BigARTM) work out-of-core, e. g. they are designed to process data that is too large to fit into a computer’s main memory at one time. This allowed us to benchmark on a fairly large collection — 3.7 million articles from the English WikipediaFootnote 5. The conversion to bag-of-words was done with \(\textsf {gensim.make\_wikicorpus}\) scriptFootnote 6, which excludes all non-article pages (such as category, file, template, user pages, etc.), and also pages that contain less than 50 words. The dictionary is formed by all words that occur in at least 20 documents, but no more than in \(10\,\%\) documents in the collection. The resulting dictionary was caped at \(|W| = 100\,000\) most frequent words.

Both Gensim and VW.LDA represents the resulting topic model as Dirichlet distribution over \(\varPhi\) and \(\varTheta \) matrices: \({\mathbf {\theta }}_{d} \sim \text {Dir}({\mathbf {\gamma }}_d)\) and \({\mathbf {\phi }}_{t} \sim \text {Dir}({\mathbf {\lambda }}_t)\). On contrary, BigARTM outputs a non-probabilistic matrices \(\varPhi\) and \(\varTheta \). To compare the perplexity we take the mean or the mode of the posterior distributions:

$$\begin{aligned} \phi ^{\mathrm {mean}}_{wt}&= \mathop {\text {norm}}\limits _{w\in W} \lambda _{wt},&\theta ^{\mathrm {mean}}_{td}&= \mathop {\text {norm}}\limits _{t\in T} \gamma _{td}; \\ \phi ^{\mathrm {mode}}_{wt}&= \mathop {\text {norm}}\limits _{w\in W} (\lambda _{wt}-1),&\theta ^{\mathrm {mode}}_{td}&= \mathop {\text {norm}}\limits _{t\in T} (\gamma _{td}-1). \end{aligned}$$

The perplexity measure is defined as

$$\begin{aligned} \mathscr {P}(D, p) = \exp \biggl ( - \frac{1}{n} \sum _{d \in D} \sum _{w \in d} n_{dw} \ln p(w \,{|}\,d) \biggr ). \end{aligned}$$
(6)

Comparison to Existing Software Packages. The Vowpal Wabbit (VW) is a library of online algorithms that cover a wide range of machine learning problems. For topic modeling VW has the VW.LDA algorithm, based on the Online Variational Bayes LDA [5]. VW.LDA is neither multi-core nor distributed, but an effective single-threaded implementation in C++ made it one of the fastest tools for topic modeling.

The Gensim library specifically targets the area of topic modeling and matrix factorization. It has two LDA implementations — LdaModel and LdaMulticore, both based on the same algorithm as VW.LDA (Online Variational Bayes LDA [5]). Gensim is entirely written in Python. Its high performance is achieved through the usage of NumPy library, built over low-level BLAS libraries (such as Intel MKL, ATLAS, or OpenBLAS). In LdaModel all batches are processed sequentially, and the concurrency happens entirely within NumPy. In LdaMulticore the workflow is similar to BigARTM — several batches are processed concurrently, and there is a single aggregation thread that asynchronously merges the results.

Table 1. The comparison of BigARTM with VW.LDA and Gensim. Train time is the time for model training, inference is the time for calculation of \(\theta _d\) of \(100\,000\) held-out documents, perplexity is calculated according to (6) on held-out documents.
Fig. 2.
figure 2

Running BigARTM in parallel: speed up (left) and memory usage (right)

Each run in our experiment performs one pass over the Wikipedia corpus and produces a model with \(|T|=100\) topics. The runtime is reported for an Intel-based CPU with 16 physical cores with hyper-threading. The collection was split into batches with 10 000 documents each (chunksize in Gensim, minibatch in VW.LDA). The update rule in online algorithm used \({\rho = (b + \tau _0)^{-0.5}}\), where b is the number of batches processed so far, and \(\tau _0\) is an a constant offset parameter introduced in [5], in our experiment \({\tau _0 = 64}\). Updates were performed after each batch in non-parallel runs, and after P batches when running in P threads. LDA priors were fixed as \({\alpha = 0.1}\)\({\beta = 0.1}\), so that \({\mathbf {\theta }}_d \sim \text {Dir}(\alpha )\)\({\mathbf {\phi }}_t \sim \text {Dir}(\beta )\).

Table 1 compares the performance of VW.LDA, Gensim LdaModel, Gensim LdaMulticore, and BigARTM. Figure 2 shows BigARTM speedup and memory consumption depending on the number of CPU threads for Amazon AWS c3.8xlarge with 32 virtual cores, Gensim 0.10.3 under Python 2.7.

Experiments with Combination of Regularizers. BigARTM has a built-in library of regularizers, which can be used in any combination. In the following experiment we combine three regularizers: sparsing of \(\phi _{t}\) distributions, sparsing of \(\theta _{d}\) distributions, and pairwise decorrelation of \(\phi _{t}\) distributions. This combination helps to improve several quality measures without significant loss of perplexity, according to experiments on the offline implementation of ARTM [13]. The goal of our experiment is to show that this remains true for the online implementation in BigARTM. We use the following built-in quality measures: the hold-out perplexity, the sparsity of \(\varPhi\) and \(\varTheta \) matrices, and the characteristics of topic lexical kernels (size, purity, and contrast) averaged across all topics.

Table 2. Comparison of LDA and ARTM models. Quality measures: \(\mathcal {P}_{10\,k}\), \(\mathcal {P}_{100\,k}\) — hold-out perplexity on 10 K and 100 K documents sets, \(\mathcal {S}_{\varPhi}\), \(\mathcal {S}_{\varTheta }\) — sparsity of \(\varPhi\) and \(\varTheta \) matrices (in %), \(\mathcal {K}_{s}\), \(\mathcal {K}_{p}\), \(\mathcal {K}_{c}\) — average topic kernel size, purity and contrast respectively.
Fig. 3.
figure 3

Comparison of LDA (thin) and ARTM (bold) models. The number of processed documents is shown along the X axis.

Table 2 compares the results of additive combination of regularizers (ARTM) and the usual LDA model. Figure 3 presents quality measures as functions of the number of processed documents. The left chart shows perplexity and sparsity of \(\varPhi\), \(\varTheta \) matrices, and the right chart shows average lexical kernel measures.

Experiments on Multi-language Wikipedia. To show how BigARTM works with multimodal datasets we prepared a text corpus containing all English and Russian Wikipedia articles with mutual interlanguage links. We represent each linked pair of articles as a single multi-language document with two modalities, one modality for each language. That is how our multi-language collection acts as a multimodal document collection.

The dump of Russian articlesFootnote 7 had been processed following the same technique as we previously used in experiments on English Wikipedia. Russian words were lemmatized with Yandex MyStem 3.0Footnote 8. To further reduce the dictionary we only keep words that appear in no less than 20 documents, but no more than in 10 % of documents in the collection. The resulting collection contains 216175 pairs of Russian–English articles, with combined dictionary of 196749 words (43 % Russian, 57 % English words).

We build multi-language model with 400 topics. They cover a wide range of themes such as science, architecture, history, culture, technologies, army, different countries. All 400 topics were reviewed by an independent assessor, and he successfully interpreted all except four topics.

Table 3 shows top 10 words for four randomly selected topics. Top words in these topics are clearly consistent between Russian and English languages. The Russian part of last topic contains some English words such as “Windows” or “Server” because it is common to use them in Russian texts without translation.

Table 3. Top 10 words with \(p(w\,{|}\,t)\) probabilities (in %) from two-language topic model, based on Russian and English Wikipedia articles with mutual interlanguage links.

6 Conclusions

BigARTM in an open source project for parallel online topic modeling of large text collections. It provides a high flexibility for various applications due to multimodality and additive combinations of regularizers. BigARTM architecture has a rich potential. Current components can be reused in a distributed solution that runs on cluster. Further improvement of single-node can be achieved by offloading batch processing into GPU.