Abstract
Recent advances in natural language processing (NLP) have shown that semantically meaningful representations of words can be efficiently acquired by distributed models. In such a case, a text document can be viewed as a bag-of-word-embeddings (BoWE), and the remaining question is how to obtain a fixed-length vector representation of the document for efficient document process. Beyond those heuristic aggregation methods, recent work has shown that one can leverage the Fisher kernel (FK) framework to generate document representations based on BoWE in a principled way. In this work, words are embedded into a Euclidean space by latent semantic indexing (LSI), and a Gaussian Mixture Model (GMM) is employed as the generative model for nonlinear FK-based aggregation. In this work, we propose an alternate FK-based aggregation method for document representation based on neural word embeddings. As we know, neural embedding models have been proven significantly better performance in word representations than LSI, where semantic relations between neural word embeddings are typically measured by cosine similarity rather than Euclidean distance. Therefore, we introduce a mixture of Von Mises-Fisher distributions (moVMF) as the generative model of neural word embeddings, and derive a new FK-based aggregation method for document representation based on BoWE. We report document classification, clustering and retrieval experiments and demonstrate that our model can produce state-of-the-art performance as compared with existing baseline methods.
Access provided by CONRICYT-eBooks. Download conference paper PDF
Similar content being viewed by others
1 Introduction
Representing text documents as fixed-length vectors is central to many language processing tasks. Perhaps the most popular fixed-length vector representation for documents is the bag-of-words (BoW) representation [1], where each word is viewed as a distinct feature dimension based on strong independent assumption. Most traditional methods either directly use the BoW representation (e.g., tf-idf vector), or are built upon BoW (e.g., matrix factorization [2, 3] and probabilistic topical models [4, 5]). Apparently, by using BoW as the foundation, rich semantic relatedness between words is lost. The document representation thus is obtained purely based on the word-by-document co-occurrence information.
Recent developments in distributed word representations [6, 7] have succeeded in revealing rich linguistic regularities between words. Specifically, by mapping each word into a continuous vector space, both syntactic and semantic relatedness between words can be captured using simple algebra over word vectors. Therefore, a natural idea is that one can build document representations based on a better foundation, namely the Bag-of-Word-Embeddings (BoWE) representation, by replacing distinct words with word vectors learned a priori with rich semantic relatedness encoded. The follow-up question is how to obtain a fixed-length vector representation of document based on BoWE for efficient document processing.
There have been several heuristic ways to obtain the document vector based on word embeddings, e.g., by using the average or weighted sum of all the word vectors contained in a document [8]. Another well-known approach is the Paragraph Vector (PV) [9] method, which jointly learns the word and document vectors through some prediction task. A common problem of all these methods is that they assume that the document vector lies in the same semantic space as words vectors. However, this may not be a necessary condition in practice since documents usually convey much richer semantics than individual words.
Recent work [10] has shown that one can use the Fisher kernel (FK) framework [11] as a flexible and principled way to generate document representations based on BoWE. It consists in non-linearly mapping the word embeddings into a higher-dimensional space and in aggregating them into a document representation. Specifically, in the FK-based aggregation, words are embedded into a Euclidean space by latent semantic indexing (LSI), and a Gaussian Mixture Model (GMM) is employed as the generative model of the word embeddings. The gradients of the GMM parameters are then used to generate the document representation. This FK-based aggregation method is highly efficient (i.e., simple adding operation to generate a new document representation), and has shown its superiority in several document clustering and retrieval tasks.
However, recent advances have shown that neural word embedding models (e.g., word2vec [6]) can produce significantly better performance in word representations than LSI. Such neural word embeddings can be efficiently acquired from large text corpus. Therefore, a natural question is whether we could leverage neural word embeddings for better document representation under the FK framework. Unfortunately, directly using the existing FK-based aggregation method [10] over neural word embeddings may not be appropriate. The major reason is that the generative model (i.e., GMM) in [10] is employed to capture the Euclidean distances between word embeddings from LSI, while semantic relations between neural word embeddings (e.g., Glove and word2vec) are typically measured by cosine similarity. Therefore, we propose an alternate FK-based aggregation method for document representation based on neural word embeddings. As we known, the von Mises-Fisher (vMF) distribution is well-suited to model directional data distributed on the unit hypersphere and capture the directional relations (i.e., cosine similarity) between vectors. Therefore, we introduce a Mixture of von Mises-Fisher distributions (moVMF) [12] as the generative model of neural word embeddings, and derive a new aggregation algorithm based moVMF model under the FK framework. We evaluated the effectiveness of our model by comparing with existing document representation methods. The empirical results demonstrate that our model can achieve new state-of-the-art performances on several document classification, clustering and retrieval tasks.
2 Related Work
We provide a short review of the works on those topics which are most related to our work: Bag-of-Words, Bag-of-Word-Embeddings, vMF and Fisher Vector.
-
Bag-of-Words. The most common fixed-length representation is Bag-of-Words (BoW) [1]. For example, in the popular TF-IDF scheme, each document is represented by tfidf values of a set of selected feature-words. Besides, several dimensionality reduction methods have been proposed based on BoW, including matrix factorization methods such as LSI [2] and NMF [3], and probabilistic topical models such as PLSA [4] and LDA [5]. LDA, the generative counterpart of PLSA, has played a major role in the development of probabilistic models for textual data. As a result, it has been extended or refined in a countless studies [13, 14]. Besides, several studies reported that LDA does not generally outperform LSI in IR or sentiment analysis tasks [15, 16]. To further tackle the prediction task, Supervised LDA [17] is developed by jointly modeling the documents and the labels.
-
Bag-of-Word-Embeddings. Recent advances in the natural language processing (NLP) community have shown that semantics of words or more formally the distances between words can be effectively revealed by distributed word representations. Specifically, neural embedding models, e.g., Word2Vec [6] and Glove [7], learn word vectors efficiently from very large text corpus. Word embeddings are useful because they encode both syntactic and semantic information of words into continuous vectors and similar words are close in vector space. With rich semantics encoded in word vectors, there have been many methods [8, 9, 18,19,20] built upon Bag-of-Word-Embedding (BoWE) for document representations.
-
vMF in topic models. The vMF distribution has been used to model directional data by placing points on a unit sphere and is known in the literature on directional statistics [21]. [12] proposed an admixture model (moVMF) that uses vMF to model the document corpus based on normalized word frequency vectors. [22] used vMF as the observational distribution of each word and used a Hierarchical Dirichlet Process (HDP) [23], a Bayesian nonparametric variant of Latent Dirichlet Allocation (LDA), to automatically infer the number of topics.
-
Fisher Kernel. Fisher kernel is a generic framework introduced in [11] for classification purposes to combine the strengths of the generative and discriminative worlds. The idea is to characterize a signal with a gradient vector derived from a probability density function (pdf) which models the generation process of the signal. This representation can then be used as input to a discriminative classifier. This framework has been successfully applied to computer vision [24, 25] and text analysis [10]. The gradient representation of the Fisher kernel has a major advantage over the histogram of occurrences of the BoW: for the same vocabulary size, it is much larger. Hence, there is no need to use costly kernels to (implicitly) project these very high-dimensional gradient vectors into a still higher dimensional space.
3 Model
In this section, we describe our proposed FK framework in detail, including the generation process of words with continuous mixture models and the FK-based aggregation. The proposed procedure is as follows:
Learning phase: Given an unlabeled training set of documents:
-
Learn the neural word embedding in a low-dimensional space, e.g., by word2vec. After this operation, each word w is then represented by a vector \(E_w\) of size d.
-
Fit a probabilistic model, i.e., a mixture of Von Mises-Fisher model (moVMF), on these neural word embeddings. The detailed description of moVMF is shown in the following Probabilistic modeling Section.
Document representation: Given a document whose BoW representation is \(\{ w_1,\dots ,w_T \}\):
-
Transform the BoW representation into the BoWE representation:
$$\begin{aligned} \{ w_1,\dots ,w_T \} \rightarrow \{ E_{w_1},\dots ,E_{w_T} \} \end{aligned}$$ -
Aggregate the neural word embeddings \(E_{w_t}\) using the Fisher Kernel framework. We detail the framework in the following Fisher kernel aggregation Section.
3.1 Probabilistic Modeling
We use the mixture of Von Mises-Fisher distributions (moVMF) as the generative model of neural word embeddings. Here we describe the vMF distribution and moVMF model in detail.
The von Mises-Fisher distribution is known in the literature on directional statistics, and suitable for data distributed on the unit hypersphere. A d-dimensional unit random vector x (i.e., \(x \in \mathbb {R}^d\) and \(||x|| = 1\)) is said to have d-variate von Mises-Fisher distribution if its probability density function is given by,
where \(||\mu ||=1\), \(\kappa \ge 0\) and \(d \ge 2\). The normalizing constant \(c_d{(\kappa )}\) is given by,
where \(I_r(\cdot )\) represents the modified Bessel function of the first kind and order r. The density \(f(x|\mu ,\kappa )\) is parameterized by the mean direction \(\mu \), and the concentration parameter \(\kappa \). The concentration parameter \(\kappa \) characterizes how strongly the unit vectors drawn from the distribution are concentrated on the mean direction \(\mu \). Larger values of \(\kappa \) imply stronger concentration about the mean direction.
Later, [12] introduce the mixture of von Mises-Fisher distributions (moVMF) that serves as a generative admixture model for directional data. Let \(f_i(x|\theta _i)\) denote a vMF distribution with parameter \(\theta _i = (\mu _i, \kappa _i)\) for \(1 \le i \le N\). Then a mixture of these N vMF distributions has a density given by
where parameters \(\varTheta =\{\alpha _1,\dots ,\alpha _N,\theta _1,\dots ,\theta _N\}\) and the \(\alpha _i\) are non-negative and sum to one. To sample a point from this mixture density we choose the i-th vMF randomly with probability \(\alpha _i\), and then sample a point on \(\mathbb {S}^{d-1}\) (\(\mathbb {S}^{d-1}\) denotes the (\(d-1\))-dimensional sphere embedded in \(\mathbb {R}^{d}\)) following \(f_i(x|\theta _i)\). To train the model, we can use the familiar EM algorithm, to efficiently iterate between estimating the most likely conditional distribution of \(\{\alpha _1,\dots ,\alpha _N\}\) in the E-step and optimizing \(\{\theta _1,\dots ,\theta _N \}\) to maximize the likelihood in the M-step. The moVMF generalizes clustering methods parameterized by cosine distance and it successfully integrates a directional measure of similarity into a probabilistic setting.
3.2 Fisher Kernel Aggregation
In this work, we describe a given document, \(X = \{ x_t, t = 1 \dots T \}\), as a set of d-dimensional neural word embeddings whose generation process can be modeled by the probability density function (pdf) of moVMF. Evidence suggests that this type of directional measure (i.e., cosine similarity) is often superior to Euclidean distance in high dimensions [26]. In this moVMF, each vMF distribution \(p_i\) can be viewed as a visual word and N is the vocabulary size. We denote \(\lambda = \{ w_i, \mu _i, \kappa _i, i = 1 \dots N \}\), where \(\{ w_i, \mu _i, \kappa _i \}\) are respectively the mixture weight, mean vector and concentration of i-th vMF.
In practice, the moVMF is estimated offline with a set of neural word embeddings learned a prior from a large training set of documents. The parameters \(\varTheta \) are estimated through the optimization of a Maximum Likelihood (ML) criterion using the Expectation-Maximization (EM) algorithm.
Since the partial derivatives with respect to mixture weights \(\alpha _{\varTheta }\) and concentration parameters \(\kappa _{\varTheta }\) carry little additional information, we only focus on the partial derivatives with respect to the mean parameters \(\mu _{\varTheta }\). Given \(\mu _{\varTheta }\), X can be described by the gradient vector:
Intuitively, it describes in which direction the parameters \(\varTheta \) of the model should be modified so that the model \(\mu _{\varTheta }\) better fits the data. Assuming that the word embeddings \(x_t\) in X are iid, we have:
In the following, \(\gamma _t(i)\) denotes the occupancy probabiltity, i.e. the probability for observation \(x_t\) to be generated by the i-th vMF. Bayes formula gives:
Simple mathematical derivation with respect to \(\mu _i\) has:
To normalize the dynamic range of different dimensions of gradient vectors, it is important to normalize the vectors. As in [11], the Fisher information matrix (FIM) \(F_{\varTheta }\) of \(\mu _{\varTheta }\) is suggested for this purpose:
As \(F_{\varTheta }\) is symmetric and positive definite, it has a Cholesky decomposition. Then, [11] proposed to measure the similarity between two samples X and Y:
Then K(X, Y) can be rewritten as a dot-product between normalized vectors \(\mathcal {G}_{\varTheta }\) with:
where \(\mathcal {G}_{\varTheta }^{X}\) is referred to as the Fisher Vector (FV) of X [27].
Let \(f_{\mu _i}\) denote the diagonal approximation of FM which corresponds respectively to \(\mu _i\). According to Eq. 8, we can get
Using the diagonal approximation of the FIM, we finally obtain the following formula for the gradient with respect to \(\mu _i\):
The FV \(\mathcal {G}_{\varTheta }^X\) is the concatenation of the \(\mathcal {G}_{i}^X, \forall i\), and is therefore N\(\times \)d dimensional, where d is the dimensionality of the continuous word embeddings and N is the number of vMFs.
4 Experiments
In this section, we conduct experiments to verify the effectiveness of our model over document classification, clustering and retrieval tasks.
4.1 Baselines
-
Bag-of-word. The Bag-of-Words model (BoW) [1] represents each document as a bag of words using tf-idf [28] as the weighting scheme. We select top 5, 000 words according to tf-idf scores and use the vanilla TFIDF in the gensim libraryFootnote 1.
-
LSI. LSI [2] maps both documents and words to lower-dimensional representations in a so-called latent semantic space using singular value decomposition (SVD) decomposition. We use the vanilla LSI in the gensim library with topic number set as 50.
-
LDA. In LDA [5], each word within a document is modeled as a finite mixture over an set of topics. We use the vanilla LDA in the gensim library with topic number set as 50.
-
cBow. Continuous Bag-of-Words model [6]. We use average pooling to compose a document vector from a set of word vectors.
-
PV. Paragraph Vector [9] is an unsupervised model to learn distributed representations of words and documents. We implement PV-DBOW and PV-DM model by ourselves since no original code is available.
-
FV-GMM. Fisher Kernel based on Gaussian mixture model (GMM) [10] is used for document representation from word embeddings. It treats documents as bags-of-embedded-words (BoEW) and to learn probabilistic mixture models once words were embedded in a Euclidean space.
We refer to our FK-based aggregation method as FV-moVMF.
4.2 Setup
We used two datasets for classificaiton, one for clustering and one for information retrieval. Preprocessing steps were applied to all the datasets: words were lowercased, non-English characters and stop words were removed. All the neural word embeddings used in the above methods were trained on the corresponding document collections in each task under 50-dimension by word2vecFootnote 2. For FK-based aggregation methods, the number of mixture components were set as 15 since we observed ignorable performance differences with larger value. In previous work, FV-GMM [10] obtained the word embeddings by LSI. For comparison, we also tried FV-GMM based on neural word embeddings.
We refer to these two types of aggregation methods as FV-GMM\(_{ LSI }\) and FV-GMM\(_{ Neu }\), respectively. Similarly, we also have two versions of FV-moVMF, namely FV-moVMF\(_{ LSI }\) and FV-moVMF\(_{ Neu }\).
4.3 Classification
We run the classification experiments on two publicly available datasets:
-
Subj, Subjectivity dataset [29]Footnote 3 which contains 5, 000 subjective instances (snippets) and 5, 000 objective instances (snippets). The task is to classify a sentence as being subjective or objective;
-
MR, Movie reviews [30] with one sentence per review. There are 5, 331 positive sentences and 5, 331 negative sentences. Classification involves detecting positive/negative reviews.
We use 10-fold cross-validation and Logistic Regression as the classifier.
Table 1 shows the evaluation results on the two datasets. The results show that learning text representations over BoWE (e.g., cBow, PV-DBOW, PV-DM) can in general achieve better performances than that over BoW (e.g., BoW, LSI and LDA) by involving richer semantics between words. For the FV models, the consistent improvements of neural embedding based methods over LSI based methods (i.e., FV-moVMF\(_{ Neu }\) and FV-GMM\(_{ Neu }\) vs FV-moVMF\(_{ LSI }\) and FV-GMM\(_{ LSI }\)) verify the effectiveness of neural embeddings in capturing word semantics. Furthermore, each version of FV-moVMFs works better than FV-GMMs (e.g., FV-moVMF\(_{ Neu }\) vs FV-GMM\(_{ Neu }\)), indicating that moVMF is a better statistical model for neural word embeddings than GMMs. Finally, FV-moVMF\(_{ Neu }\) can outperform all the baselines on the two datasets, demonstrating the effectiveness of our approach.
4.4 Clustering
We used one well-known and publicly available dataset: the 20 NewsgroupsFootnote 4, for clustering. The 20Newsgroups contains about 20, 000 newsgroup documents harvested from 20 different Usenet newsgroups, with about 1, 000 documents from each newsgroup. We compared k-means over all the methods and use two standard evaluation metricsFootnote 5 to assess the quality of the clusters, namely the Adjusted Rand Index (ARI) [31] and Normalized Mutual Information (NMI) [32]. These measures compare the clusters with respect to the partition induced by the category information. For all the clustering methods, the number of clusters is set to the true number of classes of the collections.
From Table 2, we can observe similar performance trending of different methods as that on the classification tasks. Moreover, the PV methods show better performances than FV-GMM\(_{ Neu }\). It indicates that dot product employed by PV works better than Euclidean distance used in FV-GMM\(_{ Neu }\). Finally, our FV-moVMF\(_{ Neu }\) outperforms all the other baseline models, showing the power of FK framework for document representation with the appropriate generative distribution.
4.5 Document Retrieval
We use one TREC collection: Robust04Footnote 6, for the document retrieval task. The topics of Robust04 are collected from TREC Robust Track 2004. It has approximately 500, 000 documents and the vocabulary size is about 600, 000. The retrieval experiments described in this section are implemented using the Galago Search EngineFootnote 7. We use the standard cosine similarity to produce the relevance scores between documents and the query based on different models. For evaluation, the top-ranked 1, 000 documents are compared using the mean average precision (MAP) and precision at rank 20 (P@20). We also compare with the traditional retrieval model, namely BM25 [33], and linearly combine the normalized scores of BM25 and the other models :
where (d, Q) is the document-query pair and \(\lambda \) is the interpolation parameter. In our experiments, we select \(\lambda \) as 0.8 based on the development set.
From Table 3 we can see that, simple cosine similarity between documents and query based on different representation models cannot work well in the retrieval task since many exact matching singles are lost in this way. When combined with BM25 method, improved performance can be obtained as semantic relatedness between document and query is captured. Moreover, our proposed FV-moVMF\(_{ Neu }\) can bring the largest improvement among all the combinations, indicating that our model offers a better similarity with latent representations.
5 Conclusion
In this paper we introduced an alternate FK framework for document representations based on BoWE. Our new FK-based aggregation method builds upon neural word embeddings by employing a moVMF distribution as the generative model. The experimental results demonstrate that our model can achieve new state-of-the-art performances on several document processing tasks.
Nevertheless, there is still room to improve our model in the future. For example, we could like to learn the parameters of moVMF together with the FV framework, instead of estimating offline. Moreover, it is interesting to validate the effectiveness of using other word embedding techniques like Glove [7] and other statistical models for Bag-of-Word-Embeddings.
References
Harris, Z.S.: Distributional structure. Word 10(2–3), 146–162 (1954)
Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391 (1990)
Lee, D.D., Seung, H.S.: Learning the parts of objects by non-negative matrix factorization. Nature 401(6755), 788–791 (1999)
Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd Annual International ACM SIGIR, pp. 50–57. ACM (1999)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), vol. 14, pp. 1532–1543 (2014)
Vulic, I., Moens, M.F.: Cross-lingual semantic similarity of words as the similarity of their semantic word responses. In: NAACL-HLT 2013, pp. 106–116. ACL (2013)
Le, Q.V., Mikolov, T.: Distributed representations of sentences and documents. In: ICML, vol. 14, pp. 1188–1196 (2014)
Clinchant, S., Perronnin, F.: Aggregating continuous word embeddings for information retrieval. In: Proceedings of the Workshop on Continuous Vector Space Models and their Compositionality, pp. 100–109 (2013)
Jaakkola, T., Haussler, D.: Exploiting generative models in discriminative classifiers. In: NIPS, pp. 487–493 (1999)
Banerjee, A., Dhillon, I.S., Ghosh, J., Sra, S.: Clustering on the unit hypersphere using von mises-fisher distributions. J. Mach. Learn. Res. 6, 1345–1382 (2005)
Eisenstein, J., Ahmed, A., Xing, E.P.: Sparse additive generative models of text (2011)
Weng, J., Lim, E.P., Jiang, J., He, Q.: Twitterrank: finding topic-sensitive influential twitterers. In: Proceedings of the Third ACM International Conference on Web Search and Data Mining, pp. 261–270. ACM (2010)
Wang, Q., Xu, J., Li, H., Craswell, N.: Regularized latent semantic indexing. In: Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 685–694. ACM (2011)
Maas, A.L., Daly, R.E., Pham, P.T., Huang, D., Ng, A.Y., Potts, C.: Learning word vectors for sentiment analysis. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 142–150. Association for Computational Linguistics (2011)
David, M., Blei, J.D.: Supervised topic models. In: Proceedings of Advances in Neural Information Processing Systems (2007)
Socher, R., Perelygin, A., Wu, J.Y., Chuang, J., Manning, C.D., Ng, A.Y., Potts, C.: Recursive deep models for semantic compositionality over a sentiment treebank. In: EMNLP, vol. 1631, Citeseer, p. 1642 (2013)
Sutskever, I., Martens, J., Hinton, G.E.: Generating text with recurrent neural networks. In: ICML, vol. 11, pp. 1017–1024 (2011)
Zhao, H., Lu, Z., Poupart, P.: Self-adaptive hierarchical sentence model. In: IJCAI, pp. 4069–4076 (2015)
Fisher, R.: Dispersion on a sphere. In: Proceedings of the Royal Society of London A: Mathematical, Physical and Engineering Sciences, vol. 217, pp. 295–305. The Royal Society (1953)
Batmanghelich, K., Saeedi, A., Narasimhan, K., Gershman, S.: Nonparametric spherical topic modeling with word embeddings. arXiv preprint arXiv:1604.00126 (2016)
Teh, Y.W., Jordan, M.I., Beal, M.J., Blei, D.M.: Sharing clusters among related groups: hierarchical dirichlet processes. In: NIPS, pp. 1385–1392 (2005)
Perronnin, F., Dance, C.: Fisher kernels on visual vocabularies for image categorization. In: CVPR, pp. 1–8. IEEE (2007)
Bressan, M., Cifarelli, C., Perronnin, F.: An analysis of the relationship between painters based on their work. In: ICIP, 113–116. IEEE (2008)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: NIPS, pp. 3111–3119 (2013)
Perronnin, F., Liu, Y., Sánchez, J., Poirier, H.: Large-scale image retrieval with compressed fisher vectors. In: CVPR, pp. 3384–3391. IEEE (2010)
Salton, G., McGill, M.: Introduction to Modern Information Retrieval. McGraw-Hill, New York (1983)
Pang, B., Lee, L.: A sentimental education: sentiment analysis using subjectivity summarization based on minimum cuts. In: Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, p. 271. Association for Computational Linguistics (2004)
Pang, B., Lee, L.: Seeing stars: exploiting class relationships for sentiment categorization with respect to rating scales. In: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, p. 115–124. Association for Computational Linguistics (2005)
Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2(1), 193–218 (1985)
Estévez, P.A., Tesmer, M., Perez, C.A., Zurada, J.M.: Normalized mutual information feature selection. IEEE Trans. Neural Netw. 20(2), 189–201 (2009)
Robertson, S.E., Walker, S.: Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In: SIGIR, pp. 232–241. Springer, New York (1994). https://doi.org/10.1007/978-1-4471-2099-5_24
Acknowlegements
This work was funded by the 973 Program of China under Grant No. 2014CB340401, the National Natural Science Foundation of China (NSFC) under Grants No. 61232010, 61433014, 61425016, 61472401, 61203298 and 61722211, the Youth Innovation Promotion Association CAS under Grants No. 20144310 and 2016102, and the National Key R&D Program of China under Grants No. 2016QY02D0405.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Zhang, R., Guo, J., Lan, Y., Xu, J., Cheng, X. (2018). Aggregating Neural Word Embeddings for Document Representation. In: Pasi, G., Piwowarski, B., Azzopardi, L., Hanbury, A. (eds) Advances in Information Retrieval. ECIR 2018. Lecture Notes in Computer Science(), vol 10772. Springer, Cham. https://doi.org/10.1007/978-3-319-76941-7_23
Download citation
DOI: https://doi.org/10.1007/978-3-319-76941-7_23
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-76940-0
Online ISBN: 978-3-319-76941-7
eBook Packages: Computer ScienceComputer Science (R0)