1 Introduction

Representing text documents as fixed-length vectors is central to many language processing tasks. Perhaps the most popular fixed-length vector representation for documents is the bag-of-words (BoW) representation [1], where each word is viewed as a distinct feature dimension based on strong independent assumption. Most traditional methods either directly use the BoW representation (e.g., tf-idf vector), or are built upon BoW (e.g., matrix factorization [2, 3] and probabilistic topical models [4, 5]). Apparently, by using BoW as the foundation, rich semantic relatedness between words is lost. The document representation thus is obtained purely based on the word-by-document co-occurrence information.

Recent developments in distributed word representations [6, 7] have succeeded in revealing rich linguistic regularities between words. Specifically, by mapping each word into a continuous vector space, both syntactic and semantic relatedness between words can be captured using simple algebra over word vectors. Therefore, a natural idea is that one can build document representations based on a better foundation, namely the Bag-of-Word-Embeddings (BoWE) representation, by replacing distinct words with word vectors learned a priori with rich semantic relatedness encoded. The follow-up question is how to obtain a fixed-length vector representation of document based on BoWE for efficient document processing.

There have been several heuristic ways to obtain the document vector based on word embeddings, e.g., by using the average or weighted sum of all the word vectors contained in a document [8]. Another well-known approach is the Paragraph Vector (PV) [9] method, which jointly learns the word and document vectors through some prediction task. A common problem of all these methods is that they assume that the document vector lies in the same semantic space as words vectors. However, this may not be a necessary condition in practice since documents usually convey much richer semantics than individual words.

Recent work [10] has shown that one can use the Fisher kernel (FK) framework [11] as a flexible and principled way to generate document representations based on BoWE. It consists in non-linearly mapping the word embeddings into a higher-dimensional space and in aggregating them into a document representation. Specifically, in the FK-based aggregation, words are embedded into a Euclidean space by latent semantic indexing (LSI), and a Gaussian Mixture Model (GMM) is employed as the generative model of the word embeddings. The gradients of the GMM parameters are then used to generate the document representation. This FK-based aggregation method is highly efficient (i.e., simple adding operation to generate a new document representation), and has shown its superiority in several document clustering and retrieval tasks.

However, recent advances have shown that neural word embedding models (e.g., word2vec [6]) can produce significantly better performance in word representations than LSI. Such neural word embeddings can be efficiently acquired from large text corpus. Therefore, a natural question is whether we could leverage neural word embeddings for better document representation under the FK framework. Unfortunately, directly using the existing FK-based aggregation method [10] over neural word embeddings may not be appropriate. The major reason is that the generative model (i.e., GMM) in [10] is employed to capture the Euclidean distances between word embeddings from LSI, while semantic relations between neural word embeddings (e.g., Glove and word2vec) are typically measured by cosine similarity. Therefore, we propose an alternate FK-based aggregation method for document representation based on neural word embeddings. As we known, the von Mises-Fisher (vMF) distribution is well-suited to model directional data distributed on the unit hypersphere and capture the directional relations (i.e., cosine similarity) between vectors. Therefore, we introduce a Mixture of von Mises-Fisher distributions (moVMF) [12] as the generative model of neural word embeddings, and derive a new aggregation algorithm based moVMF model under the FK framework. We evaluated the effectiveness of our model by comparing with existing document representation methods. The empirical results demonstrate that our model can achieve new state-of-the-art performances on several document classification, clustering and retrieval tasks.

2 Related Work

We provide a short review of the works on those topics which are most related to our work: Bag-of-Words, Bag-of-Word-Embeddings, vMF and Fisher Vector.

  • Bag-of-Words. The most common fixed-length representation is Bag-of-Words (BoW) [1]. For example, in the popular TF-IDF scheme, each document is represented by tfidf values of a set of selected feature-words. Besides, several dimensionality reduction methods have been proposed based on BoW, including matrix factorization methods such as LSI [2] and NMF [3], and probabilistic topical models such as PLSA [4] and LDA [5]. LDA, the generative counterpart of PLSA, has played a major role in the development of probabilistic models for textual data. As a result, it has been extended or refined in a countless studies [13, 14]. Besides, several studies reported that LDA does not generally outperform LSI in IR or sentiment analysis tasks [15, 16]. To further tackle the prediction task, Supervised LDA [17] is developed by jointly modeling the documents and the labels.

  • Bag-of-Word-Embeddings. Recent advances in the natural language processing (NLP) community have shown that semantics of words or more formally the distances between words can be effectively revealed by distributed word representations. Specifically, neural embedding models, e.g., Word2Vec [6] and Glove [7], learn word vectors efficiently from very large text corpus. Word embeddings are useful because they encode both syntactic and semantic information of words into continuous vectors and similar words are close in vector space. With rich semantics encoded in word vectors, there have been many methods [8, 9, 18,19,20] built upon Bag-of-Word-Embedding (BoWE) for document representations.

  • vMF in topic models. The vMF distribution has been used to model directional data by placing points on a unit sphere and is known in the literature on directional statistics [21]. [12] proposed an admixture model (moVMF) that uses vMF to model the document corpus based on normalized word frequency vectors. [22] used vMF as the observational distribution of each word and used a Hierarchical Dirichlet Process (HDP) [23], a Bayesian nonparametric variant of Latent Dirichlet Allocation (LDA), to automatically infer the number of topics.

  • Fisher Kernel. Fisher kernel is a generic framework introduced in [11] for classification purposes to combine the strengths of the generative and discriminative worlds. The idea is to characterize a signal with a gradient vector derived from a probability density function (pdf) which models the generation process of the signal. This representation can then be used as input to a discriminative classifier. This framework has been successfully applied to computer vision [24, 25] and text analysis [10]. The gradient representation of the Fisher kernel has a major advantage over the histogram of occurrences of the BoW: for the same vocabulary size, it is much larger. Hence, there is no need to use costly kernels to (implicitly) project these very high-dimensional gradient vectors into a still higher dimensional space.

3 Model

In this section, we describe our proposed FK framework in detail, including the generation process of words with continuous mixture models and the FK-based aggregation. The proposed procedure is as follows:

Learning phase: Given an unlabeled training set of documents:

  • Learn the neural word embedding in a low-dimensional space, e.g., by word2vec. After this operation, each word w is then represented by a vector \(E_w\) of size d.

  • Fit a probabilistic model, i.e., a mixture of Von Mises-Fisher model (moVMF), on these neural word embeddings. The detailed description of moVMF is shown in the following Probabilistic modeling Section.

Document representation: Given a document whose BoW representation is \(\{ w_1,\dots ,w_T \}\):

  • Transform the BoW representation into the BoWE representation:

    $$\begin{aligned} \{ w_1,\dots ,w_T \} \rightarrow \{ E_{w_1},\dots ,E_{w_T} \} \end{aligned}$$
  • Aggregate the neural word embeddings \(E_{w_t}\) using the Fisher Kernel framework. We detail the framework in the following Fisher kernel aggregation Section.

3.1 Probabilistic Modeling

We use the mixture of Von Mises-Fisher distributions (moVMF) as the generative model of neural word embeddings. Here we describe the vMF distribution and moVMF model in detail.

The von Mises-Fisher distribution is known in the literature on directional statistics, and suitable for data distributed on the unit hypersphere. A d-dimensional unit random vector x (i.e., \(x \in \mathbb {R}^d\) and \(||x|| = 1\)) is said to have d-variate von Mises-Fisher distribution if its probability density function is given by,

$$\begin{aligned} f(x|\mu ,\kappa ){=} c_d{(\kappa )} e^{\kappa {\mu }^\mathrm {T} x}, \end{aligned}$$
(1)

where \(||\mu ||=1\), \(\kappa \ge 0\) and \(d \ge 2\). The normalizing constant \(c_d{(\kappa )}\) is given by,

$$\begin{aligned} c_d{(\kappa )} {=} \frac{{\kappa }^{d/2-1}}{(2\pi )^{d/2} I_{d/2-1}{(\kappa )}}, \end{aligned}$$
(2)

where \(I_r(\cdot )\) represents the modified Bessel function of the first kind and order r. The density \(f(x|\mu ,\kappa )\) is parameterized by the mean direction \(\mu \), and the concentration parameter \(\kappa \). The concentration parameter \(\kappa \) characterizes how strongly the unit vectors drawn from the distribution are concentrated on the mean direction \(\mu \). Larger values of \(\kappa \) imply stronger concentration about the mean direction.

Later, [12] introduce the mixture of von Mises-Fisher distributions (moVMF) that serves as a generative admixture model for directional data. Let \(f_i(x|\theta _i)\) denote a vMF distribution with parameter \(\theta _i = (\mu _i, \kappa _i)\) for \(1 \le i \le N\). Then a mixture of these N vMF distributions has a density given by

$$\begin{aligned} f(x|\varTheta ) = \sum _{i=1}^{N}\alpha _i f_i(x|\theta _i), \end{aligned}$$
(3)

where parameters \(\varTheta =\{\alpha _1,\dots ,\alpha _N,\theta _1,\dots ,\theta _N\}\) and the \(\alpha _i\) are non-negative and sum to one. To sample a point from this mixture density we choose the i-th vMF randomly with probability \(\alpha _i\), and then sample a point on \(\mathbb {S}^{d-1}\) (\(\mathbb {S}^{d-1}\) denotes the (\(d-1\))-dimensional sphere embedded in \(\mathbb {R}^{d}\)) following \(f_i(x|\theta _i)\). To train the model, we can use the familiar EM algorithm, to efficiently iterate between estimating the most likely conditional distribution of \(\{\alpha _1,\dots ,\alpha _N\}\) in the E-step and optimizing \(\{\theta _1,\dots ,\theta _N \}\) to maximize the likelihood in the M-step. The moVMF generalizes clustering methods parameterized by cosine distance and it successfully integrates a directional measure of similarity into a probabilistic setting.

3.2 Fisher Kernel Aggregation

In this work, we describe a given document, \(X = \{ x_t, t = 1 \dots T \}\), as a set of d-dimensional neural word embeddings whose generation process can be modeled by the probability density function (pdf) of moVMF. Evidence suggests that this type of directional measure (i.e., cosine similarity) is often superior to Euclidean distance in high dimensions [26]. In this moVMF, each vMF distribution \(p_i\) can be viewed as a visual word and N is the vocabulary size. We denote \(\lambda = \{ w_i, \mu _i, \kappa _i, i = 1 \dots N \}\), where \(\{ w_i, \mu _i, \kappa _i \}\) are respectively the mixture weight, mean vector and concentration of i-th vMF.

In practice, the moVMF is estimated offline with a set of neural word embeddings learned a prior from a large training set of documents. The parameters \(\varTheta \) are estimated through the optimization of a Maximum Likelihood (ML) criterion using the Expectation-Maximization (EM) algorithm.

Since the partial derivatives with respect to mixture weights \(\alpha _{\varTheta }\) and concentration parameters \(\kappa _{\varTheta }\) carry little additional information, we only focus on the partial derivatives with respect to the mean parameters \(\mu _{\varTheta }\). Given \(\mu _{\varTheta }\), X can be described by the gradient vector:

$$\begin{aligned} G_{\varTheta }^X = {\nabla }_{\mu _{\varTheta }}^{T} \log f(X|\varTheta ). \end{aligned}$$
(4)

Intuitively, it describes in which direction the parameters \(\varTheta \) of the model should be modified so that the model \(\mu _{\varTheta }\) better fits the data. Assuming that the word embeddings \(x_t\) in X are iid, we have:

$$\begin{aligned} G_{\varTheta }^X = \sum _{t=1}^{T}{\nabla }_{\mu _{\varTheta }} \log f(x_t|\varTheta ). \end{aligned}$$
(5)

In the following, \(\gamma _t(i)\) denotes the occupancy probabiltity, i.e. the probability for observation \(x_t\) to be generated by the i-th vMF. Bayes formula gives:

$$\begin{aligned} \gamma _t(i)=p(i|x_t,\varTheta )=\frac{\alpha _i f_i(x|\theta _i)}{\sum _{j=1}^{N} \alpha _j f_j(x|\theta _j)}. \end{aligned}$$
(6)

Simple mathematical derivation with respect to \(\mu _i\) has:

$$\begin{aligned} G_{\mu _{i}}^X=\sum _{t=1}^{T} \gamma _t(i) \kappa _i x_t. \end{aligned}$$
(7)

To normalize the dynamic range of different dimensions of gradient vectors, it is important to normalize the vectors. As in [11], the Fisher information matrix (FIM) \(F_{\varTheta }\) of \(\mu _{\varTheta }\) is suggested for this purpose:

$$\begin{aligned} F_{\varTheta } = E_{x \sim \mu _{\varTheta }}[{\nabla }_{\varTheta } \log f(x|\varTheta ) {\nabla }_{\varTheta } \log { f(x|\varTheta )}^{\prime }]. \end{aligned}$$
(8)

As \(F_{\varTheta }\) is symmetric and positive definite, it has a Cholesky decomposition. Then, [11] proposed to measure the similarity between two samples X and Y:

$$\begin{aligned} K(X,Y) = {G_{\varTheta }^X}^{\prime } F_{\varTheta }^{-1} G_{\varTheta }^Y. \end{aligned}$$
(9)

Then K(XY) can be rewritten as a dot-product between normalized vectors \(\mathcal {G}_{\varTheta }\) with:

$$\begin{aligned} \mathcal {G}_{\varTheta }^{X} = F_{\varTheta }^{-1/2} G_{\varTheta }^{X}, \end{aligned}$$
(10)

where \(\mathcal {G}_{\varTheta }^{X}\) is referred to as the Fisher Vector (FV) of X [27].

Let \(f_{\mu _i}\) denote the diagonal approximation of FM which corresponds respectively to \(\mu _i\). According to Eq. 8, we can get

$$\begin{aligned} f_{\mu _i} = \int _X f(X|\varTheta ) [\sum _{t=1}^{T} \gamma _t(i) \kappa _i x_t]^{2}dX. \end{aligned}$$
(11)

Using the diagonal approximation of the FIM, we finally obtain the following formula for the gradient with respect to \(\mu _i\):

$$\begin{aligned} \mathcal {G}_{i}^{X} = f_{\mu _i}^{-1/2} G_{\mu _{i}}^X =\sum _{t=1}^{T} \frac{\gamma _t(i) x_t d}{w_i \kappa _i ||\mu _i||}. \end{aligned}$$
(12)

The FV \(\mathcal {G}_{\varTheta }^X\) is the concatenation of the \(\mathcal {G}_{i}^X, \forall i\), and is therefore N\(\times \)d dimensional, where d is the dimensionality of the continuous word embeddings and N is the number of vMFs.

4 Experiments

In this section, we conduct experiments to verify the effectiveness of our model over document classification, clustering and retrieval tasks.

4.1 Baselines

  • Bag-of-word. The Bag-of-Words model (BoW) [1] represents each document as a bag of words using tf-idf [28] as the weighting scheme. We select top 5, 000 words according to tf-idf scores and use the vanilla TFIDF in the gensim libraryFootnote 1.

  • LSI. LSI [2] maps both documents and words to lower-dimensional representations in a so-called latent semantic space using singular value decomposition (SVD) decomposition. We use the vanilla LSI in the gensim library with topic number set as 50.

  • LDA. In LDA [5], each word within a document is modeled as a finite mixture over an set of topics. We use the vanilla LDA in the gensim library with topic number set as 50.

  • cBow. Continuous Bag-of-Words model [6]. We use average pooling to compose a document vector from a set of word vectors.

  • PV. Paragraph Vector [9] is an unsupervised model to learn distributed representations of words and documents. We implement PV-DBOW and PV-DM model by ourselves since no original code is available.

  • FV-GMM. Fisher Kernel based on Gaussian mixture model (GMM) [10] is used for document representation from word embeddings. It treats documents as bags-of-embedded-words (BoEW) and to learn probabilistic mixture models once words were embedded in a Euclidean space.

We refer to our FK-based aggregation method as FV-moVMF.

4.2 Setup

We used two datasets for classificaiton, one for clustering and one for information retrieval. Preprocessing steps were applied to all the datasets: words were lowercased, non-English characters and stop words were removed. All the neural word embeddings used in the above methods were trained on the corresponding document collections in each task under 50-dimension by word2vecFootnote 2. For FK-based aggregation methods, the number of mixture components were set as 15 since we observed ignorable performance differences with larger value. In previous work, FV-GMM [10] obtained the word embeddings by LSI. For comparison, we also tried FV-GMM based on neural word embeddings.

We refer to these two types of aggregation methods as FV-GMM\(_{ LSI }\) and FV-GMM\(_{ Neu }\), respectively. Similarly, we also have two versions of FV-moVMF, namely FV-moVMF\(_{ LSI }\) and FV-moVMF\(_{ Neu }\).

4.3 Classification

We run the classification experiments on two publicly available datasets:

  • Subj, Subjectivity dataset [29]Footnote 3 which contains 5, 000 subjective instances (snippets) and 5, 000 objective instances (snippets). The task is to classify a sentence as being subjective or objective;

  • MR, Movie reviews [30] with one sentence per review. There are 5, 331 positive sentences and 5, 331 negative sentences. Classification involves detecting positive/negative reviews.

We use 10-fold cross-validation and Logistic Regression as the classifier.

Table 1. Classification accuracies (%) of different models. Best scores are bold. Two-tailed t-tests demonstrate the improvements of our model to all the baseline models are statistically significant (\(^{\ddag }\) indicates \(\text {p-value} < 0.05\)).
Table 2. Clustering experiments of different models (in %). Best scores are bold. Two-tailed t-tests demonstrate the improvements of our model to all the baseline models are statistically significant (\(^{\ddag }\) indicates \(\text {p-value} < 0.05\)).

Table 1 shows the evaluation results on the two datasets. The results show that learning text representations over BoWE (e.g., cBow, PV-DBOW, PV-DM) can in general achieve better performances than that over BoW (e.g., BoW, LSI and LDA) by involving richer semantics between words. For the FV models, the consistent improvements of neural embedding based methods over LSI based methods (i.e., FV-moVMF\(_{ Neu }\) and FV-GMM\(_{ Neu }\) vs FV-moVMF\(_{ LSI }\) and FV-GMM\(_{ LSI }\)) verify the effectiveness of neural embeddings in capturing word semantics. Furthermore, each version of FV-moVMFs works better than FV-GMMs (e.g., FV-moVMF\(_{ Neu }\) vs FV-GMM\(_{ Neu }\)), indicating that moVMF is a better statistical model for neural word embeddings than GMMs. Finally, FV-moVMF\(_{ Neu }\) can outperform all the baselines on the two datasets, demonstrating the effectiveness of our approach.

4.4 Clustering

We used one well-known and publicly available dataset: the 20 NewsgroupsFootnote 4, for clustering. The 20Newsgroups contains about 20, 000 newsgroup documents harvested from 20 different Usenet newsgroups, with about 1, 000 documents from each newsgroup. We compared k-means over all the methods and use two standard evaluation metricsFootnote 5 to assess the quality of the clusters, namely the Adjusted Rand Index (ARI) [31] and Normalized Mutual Information (NMI) [32]. These measures compare the clusters with respect to the partition induced by the category information. For all the clustering methods, the number of clusters is set to the true number of classes of the collections.

From Table 2, we can observe similar performance trending of different methods as that on the classification tasks. Moreover, the PV methods show better performances than FV-GMM\(_{ Neu }\). It indicates that dot product employed by PV works better than Euclidean distance used in FV-GMM\(_{ Neu }\). Finally, our FV-moVMF\(_{ Neu }\) outperforms all the other baseline models, showing the power of FK framework for document representation with the appropriate generative distribution.

Table 3. Retrieval experiments of different models (in %). Best scores are bold.

4.5 Document Retrieval

We use one TREC collection: Robust04Footnote 6, for the document retrieval task. The topics of Robust04 are collected from TREC Robust Track 2004. It has approximately 500, 000 documents and the vocabulary size is about 600, 000. The retrieval experiments described in this section are implemented using the Galago Search EngineFootnote 7. We use the standard cosine similarity to produce the relevance scores between documents and the query based on different models. For evaluation, the top-ranked 1, 000 documents are compared using the mean average precision (MAP) and precision at rank 20 (P@20). We also compare with the traditional retrieval model, namely BM25 [33], and linearly combine the normalized scores of BM25 and the other models :

$$\begin{aligned} score(d,Q) = \lambda score_{BM25}(d,Q) + (1-\lambda ) score_{model}(d,Q), \end{aligned}$$
(13)

where (dQ) is the document-query pair and \(\lambda \) is the interpolation parameter. In our experiments, we select \(\lambda \) as 0.8 based on the development set.

From Table 3 we can see that, simple cosine similarity between documents and query based on different representation models cannot work well in the retrieval task since many exact matching singles are lost in this way. When combined with BM25 method, improved performance can be obtained as semantic relatedness between document and query is captured. Moreover, our proposed FV-moVMF\(_{ Neu }\) can bring the largest improvement among all the combinations, indicating that our model offers a better similarity with latent representations.

5 Conclusion

In this paper we introduced an alternate FK framework for document representations based on BoWE. Our new FK-based aggregation method builds upon neural word embeddings by employing a moVMF distribution as the generative model. The experimental results demonstrate that our model can achieve new state-of-the-art performances on several document processing tasks.

Nevertheless, there is still room to improve our model in the future. For example, we could like to learn the parameters of moVMF together with the FV framework, instead of estimating offline. Moreover, it is interesting to validate the effectiveness of using other word embedding techniques like Glove [7] and other statistical models for Bag-of-Word-Embeddings.