Abstract
This paper develops a new computational model for learning stochastic rules, called PAD (Probably Almost Discriminative)-learning model, based on statistical hypothesis testing theory. The model deals with the problem of designing a discrimination algorithm to test whether or not any given test sequence of examples of pairs of (instance, label) has come from a given stochastic rule P*. Here a composite hypothesis \(\tilde P\)is unknown other than it belongs to a given class \(\mathcal{C}\)
In this model, we propose a new discrimination algorithm on the basis of the MDL (Minimum Description Length) principle, and then derive upper bounds on the least test sample size required by the algorithm to guarantee that two types of error probabilities are respectively less than δ1 and δ2 provided that the distance between the two rules to be discriminated is not less than ε.
For the parametric case where \(\mathcal{C}\) is a parametric class, this paper shows that an upper bound on test sample size is given by \(O(\tfrac{1}{\varepsilon }ln\tfrac{1}{{\delta _1 }} + \tfrac{1}{{\varepsilon ^2 }}ln\tfrac{1}{{\delta _2 }} + \tfrac{{\tilde k}}{\varepsilon } + \tfrac{{\tilde k}}{\varepsilon } + \tfrac{{\ell (\tilde M)}}{\varepsilon })\) Here \(\tilde k\) is the number of real-valued parameters for the composite hypothesis \(\tilde P\), and \(\ell (\tilde M)\) is the description length for the countable model for \(\tilde P\). Further this paper shows that the MDL-based discrimination algorithm performs well in the sense of sample complexity efficiency, comparing it with other kinds of information-criteria-based discrimination algorithms. This paper also shows how to transform any stochastic PAC (Probably Approximately Correct)-learning algorithm into a PAD-learning algorithm.
For the non-parametric case where \(\mathcal{C}\) is a non-parametric class but the discrimination algorithm uses a parametric class, this paper demonstrates that the sample complexity bound for the MDL-based discrimination algorithm is essentially related to Barron and Cover's index of resolvability. The sample complexity bound gives a new view at the relationship between the index of resolvability and the MDL principle from the PAD-learning viewpoint.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Barron, A.R., & Cover T.M. (1991). Minimum complexity density estimation. IEEE Trans. on Information Theory, IT-37, 1034–1054.
Blahut, R.E. (1988). Principle and Practice of Information Theory. Addison-Wesley.
Cover, T.M., & Thomas, J.A. (1991). Elements of Information Theory. Wiley-Interscience.
DeSantis, A., Markowsky, G., & Wegman, M.N. (1988). Learning probabilistic prediction functions. Proceedings of the First Annual Workshop on Computational Learning Theory (pp. 312–328), Morgan Kaufmann.
Gutman, M. (1989). Asymptotically optimal classification for multiple tests with empirically observed statistics. IEEE Trans. on Information Theory, IT-35, 2, 401–408.
Hand, D.J. (1981). Discrimination and Classification. New York: Wiley.
Haussler, D., & Barron, A. (1992). How well does the Bayes method work in on-line predictions of {+1, −1}-values. Proceedings of the Third NEC Symposium (pp. 74–100): SIAM.
Haussler, D., & Long P. (1990). A generalization of Sauer's lemma. Technical Report UCSC CRL-90-15, University of California at Santa Cruz.
Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. Jr. Amer. Stat. Assoc., 58, 13–30.
Hoeffding, W. (1965). Asymptotically optimal test for multinomial distributions. Annals of Mathematical Statistics, 36, 369–400.
Kearns, M., & Schapire, R. (1994). Efficient distribution-free learning of probabilistic concepts. Journal of Computer and System Sciences, 48, 3, 464–497.
Kraft, C. (1949). A device for quantizing, grouping, and coding amplitude modulated pulses. M.S. Thesis, Department of Electrical Engineering, MIT, Cambridge, MA.
Rissanen, J. (1978). Modeling by shortest data description. Automatica, 14, 465–471.
Rissanen, J. (1983). A universal prior for integers and estimation by minimum description length. Annals of Statistics, 11, 416–431.
Rissanen, J. (1987). Stochastic complexity. J.R. Statist. Soc. B, 49, 3, 223–239.
Rissanen, J. (1989). Stochastic Complexity in Statistical Inquiry. World Scientific, Series in Computer Science, 15.
Rissanen, J., & Yu, B. (1991). MDL learning. Progress in Automation and Information Systems, Springer Verlag.
Rivest, R.L. (1987). Learning decision lists. Machine Learning, 2, 229–246.
Schwarz, G. (1978). Estimation of the dimension of a model. Annals of Statistics, 6, 416–446.
Shannon, C.E. (1948). A mathematical theory of communications. Bell Syst. Tech. J. 47:147–157.
Valiant, L.G. (1984). A theory of the learnable. Communications. of the ACM, 27, 1134–1142.
Wallace, C.S., & Boulton, D.M. (1968). An information measure for classification. Computer Journal, 185–194.
Yamanishi, K. (1991). A loss bound model for on-line stochastic prediction strategies. Proceedings of the Fourth Annual Workshop on Computational Learning Theory (pp. 290–302), Morgan Kaufmann.
Yamanishi, K. (1992a). A learning criterion for stochastic rules. Machine Learning: Special Issues for COLT-90, 9, 165–203.
Yamanishi, K. (1992b). Probably almost discriminative learning. Proceedings of the Fifth ACM Workshop on Computational Learning Theory (pp. 164–171), ACM Press.
Yamanishi, K. (1993). On polynomial-time probably almost discriminative learnability. Proceedings of the Sixth ACM Conference on Computational Learning Theory (pp. 94–100), ACM Press.
Zeitouni, O., & Gutman, M. (1991). On universal hypothesis testing via large deviations. IEEE Trans. on Information Theory, IT-37, 285–290.
Ziv, J. (1988). On classification with empirically observed statistics and universal data compression. IEEE Trans. on Information Theory, IT-34, 278–286.
Ziv, J., & Lempel, A. (1978). Compression of individual sequences via variable-rate coding. IEEE Trans. on Information Theory, IT-24, 530–536.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Yamanishi, K. Probably Almost Discriminative Learning. Machine Learning 18, 23–50 (1995). https://doi.org/10.1023/A:1022870506888
Issue Date:
DOI: https://doi.org/10.1023/A:1022870506888