Test Data Likelihood for PLSA Models

Brants, Thorsten

doi:10.1007/s10791-005-5658-8

Test Data Likelihood for PLSA Models

Published: April 2005

Volume 8, pages 181–196, (2005)
Cite this article

Download PDF

Information Retrieval Aims and scope Submit manuscript

Test Data Likelihood for PLSA Models

Download PDF

Thorsten Brants¹

570 Accesses
19 Citations
Explore all metrics

Abstract

Probabilistic Latent Semantic Analysis (PLSA) is a statistical latent class model that has recently received considerable attention. In its usual formulation it cannot assign likelihoods to unseen documents. Furthermore, it assigns a probability of zero to unseen documents during training. We point out that one of the two existing alternative formulations of the Expectation-Maximization algorithms for PLSA does not require this assumption. However, even that formulation does not allow calculation ofthe actual likelihood values. We therefore derive a new test-data likelihood substitute for PLSA and compare it to three existing likelihood substitutes. An empirical evaluation shows that our new likelihood substitute produces the best predictions about accuracies in two different IR tasks and is therefore best suited to determine the number of EM steps when training PLSA models. The new likelihood measure and its evaluation also suggest that PLSA is not very sensitive to overfitting for the two tasks considered. This renders additions like tempered EM that especially address overfitting unnecessary.

Article PDF

A unified statistical approach to non-negative matrix factorization and probabilistic latent semantic indexing

Article 15 November 2014

Monolingual and Cross-Lingual Probabilistic Topic Models and Their Applications in Information Retrieval

Similarity Measures Based on Latent Dirichlet Allocation

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

Beeferman D, Berger A and Lafferty J (1999) Statistical models for text segmentation. Machine Learning 34(1–3):177–210.
Google Scholar
Blei D, Ng A and Jordan M (2001) Latent dirichlet allocation. In: Proceedings of NIPS-2001, Vancouver, BC, Canada, pp. 601–608.
Brants T, Chen F and Tsochantaridis I (2002) Topic-based document segmentation with probabilistic latent semantic analysis. In: International Conference on Information and Knowledge Management (CIKM), McLean, VA, pp. 211–218.
Choi FYY, Wiemer-Hastings P and Moore J (2001) Latent semantic analysis for text segmentation. In: Lee L and Harman D, Eds., In Proceedings of the 2001 Conference on Empirical Methods in Natural Language Processing, pp. 109–117.
Dempster AP, Laird NM and Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 39(1):1–21.
Google Scholar
Gildea D and Hofmann T (1999) Topic based language models using EM. In: Proceedings of 6th European Conference On Speech Communication and Technology (Eurospeech’99), Budapest, Hungary, pp. 2167–2170.
Hearst MA (1997) TextTiling: Segmenting text into multi-paragraph subtopic passages. Computational Linguistics, 23(1):33–64.
Google Scholar
Hofmann T (1999) Probabilistic latent semantic analysis. In: Proceedings of Uncertainty in Artificial Intelligence, Stockholm, Sweden, pp. 289–296.
Hofmann T (2000) Probabilistic latent semantic indexing. In: Proceedings of SIGIR-99, Berkeley, CA, pp. 35–44.
Hofmann T (2001) Unsupervised learning by probabilistic latent semantic analysis. Machine Learning, 42:177–196.
Google Scholar
Hofmann T and Puzicha J (1998) Unsupervised learning from dyadic data. Technical Report TR-98-042, ICSI, Berkeley, CA.
Google Scholar
Li H and Yamanishi K (2000) Topic analysis using a finite mixture model. In: Proceedings of Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, pp. 35–44.
Rooth M, Riezler S, Prescher D, Carroll G, and Beil F (1999) Inducing a semantically annotated lexicon via EM-based clustering. In: Proceedings of ACL-99, College Park, MD, USA, pp. 104–111.
Google Scholar
Saul L and Pereira F (1997) Aggregate and mixed-order markov models for statistical language processing. In: Proceedings of the Second Conference on Empirical Methods in Natural Language Processing (EMNLP), San Francisco, CA, Association for Computational Linguistics, pp. 81–89.
Google Scholar

Download references

Author information

Authors and Affiliations

Google, Inc., 1600 Amphitheatre Parkway, Bldg. 42, Mountain View, CA, 94043, USA
Thorsten Brants

Authors

Thorsten Brants
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Thorsten Brants.

Additional information

The work reported here was carried out while the author was at the Palo Alto Research Center (PARC).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Brants, T. Test Data Likelihood for PLSA Models. Inf Retrieval 8, 181–196 (2005). https://doi.org/10.1007/s10791-005-5658-8

Download citation

Issue Date: April 2005
DOI: https://doi.org/10.1007/s10791-005-5658-8

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Test Data Likelihood for PLSA Models

Abstract

Article PDF

Similar content being viewed by others

A unified statistical approach to non-negative matrix factorization and probabilistic latent semantic indexing

Monolingual and Cross-Lingual Probabilistic Topic Models and Their Applications in Information Retrieval

Similarity Measures Based on Latent Dirichlet Allocation

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Test Data Likelihood for PLSA Models

Abstract

Article PDF

Similar content being viewed by others

A unified statistical approach to non-negative matrix factorization and probabilistic latent semantic indexing

Monolingual and Cross-Lingual Probabilistic Topic Models and Their Applications in Information Retrieval

Similarity Measures Based on Latent Dirichlet Allocation

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation