Abstract
Probabilistic Latent Semantic Analysis (PLSA) is a statistical latent class model that has recently received considerable attention. In its usual formulation it cannot assign likelihoods to unseen documents. Furthermore, it assigns a probability of zero to unseen documents during training. We point out that one of the two existing alternative formulations of the Expectation-Maximization algorithms for PLSA does not require this assumption. However, even that formulation does not allow calculation ofthe actual likelihood values. We therefore derive a new test-data likelihood substitute for PLSA and compare it to three existing likelihood substitutes. An empirical evaluation shows that our new likelihood substitute produces the best predictions about accuracies in two different IR tasks and is therefore best suited to determine the number of EM steps when training PLSA models. The new likelihood measure and its evaluation also suggest that PLSA is not very sensitive to overfitting for the two tasks considered. This renders additions like tempered EM that especially address overfitting unnecessary.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Beeferman D, Berger A and Lafferty J (1999) Statistical models for text segmentation. Machine Learning 34(1–3):177–210.
Blei D, Ng A and Jordan M (2001) Latent dirichlet allocation. In: Proceedings of NIPS-2001, Vancouver, BC, Canada, pp. 601–608.
Brants T, Chen F and Tsochantaridis I (2002) Topic-based document segmentation with probabilistic latent semantic analysis. In: International Conference on Information and Knowledge Management (CIKM), McLean, VA, pp. 211–218.
Choi FYY, Wiemer-Hastings P and Moore J (2001) Latent semantic analysis for text segmentation. In: Lee L and Harman D, Eds., In Proceedings of the 2001 Conference on Empirical Methods in Natural Language Processing, pp. 109–117.
Dempster AP, Laird NM and Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 39(1):1–21.
Gildea D and Hofmann T (1999) Topic based language models using EM. In: Proceedings of 6th European Conference On Speech Communication and Technology (Eurospeech’99), Budapest, Hungary, pp. 2167–2170.
Hearst MA (1997) TextTiling: Segmenting text into multi-paragraph subtopic passages. Computational Linguistics, 23(1):33–64.
Hofmann T (1999) Probabilistic latent semantic analysis. In: Proceedings of Uncertainty in Artificial Intelligence, Stockholm, Sweden, pp. 289–296.
Hofmann T (2000) Probabilistic latent semantic indexing. In: Proceedings of SIGIR-99, Berkeley, CA, pp. 35–44.
Hofmann T (2001) Unsupervised learning by probabilistic latent semantic analysis. Machine Learning, 42:177–196.
Hofmann T and Puzicha J (1998) Unsupervised learning from dyadic data. Technical Report TR-98-042, ICSI, Berkeley, CA.
Li H and Yamanishi K (2000) Topic analysis using a finite mixture model. In: Proceedings of Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, pp. 35–44.
Rooth M, Riezler S, Prescher D, Carroll G, and Beil F (1999) Inducing a semantically annotated lexicon via EM-based clustering. In: Proceedings of ACL-99, College Park, MD, USA, pp. 104–111.
Saul L and Pereira F (1997) Aggregate and mixed-order markov models for statistical language processing. In: Proceedings of the Second Conference on Empirical Methods in Natural Language Processing (EMNLP), San Francisco, CA, Association for Computational Linguistics, pp. 81–89.
Author information
Authors and Affiliations
Corresponding author
Additional information
The work reported here was carried out while the author was at the Palo Alto Research Center (PARC).
Rights and permissions
About this article
Cite this article
Brants, T. Test Data Likelihood for PLSA Models. Inf Retrieval 8, 181–196 (2005). https://doi.org/10.1007/s10791-005-5658-8
Issue Date:
DOI: https://doi.org/10.1007/s10791-005-5658-8