Abstract
In many language processing tasks, most of the sentences generally convey rather simple meanings. Moreover, these tasks have a limited semantic domain that can be properly covered with a simple lexicon and a restricted syntax. Nevertheless, casual users are by no means expected to comply with any kind of formal syntactic restrictions due to the inherent “spontaneous” nature of human language. In this work, the use of error-correcting-based learning techniques is proposed to cope with the complex syntactic variability which is generally exhibited by natural language. In our approach, a complex task is modeled in terms of a basic finite state model, F, and a stochastic error model, E. F should account for the basic (syntactic) structures underlying this task, which would convey the meaning. E should account for general vocabulary variations, word disappearance, superfluous words, and so on. Each “natural” user sentence is thus considered as a corrupted version (according to E) of some “simple” sentence of L(F). Adequate bootstrapping procedures are presented that incrementally improve the “structure” of F while estimating the probabilities for the operations of E. These techniques have been applied to a practical task of moderately high syntactic variability, and the results which show the potential of the proposed approach are presented.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Amengual, J. C. (1999). Técnicas de Corrección de Errores y su Aplicación en Reconocimiento de Formas, Tratamiento del Lenguaje Natural y Traducción Automática. Ph.D. thesis, Departamento de Sistemas Informáticos y Computación, Universidad Politécnica de Valencia, Valencia (Spain), in Spanish.
Amengual, J. C. (2000). An a* search-based k-best error-correcting viterbi parser. Technical Report DI 02-03/00, Unidad Predepartamental de Informática, Universidad Jaume I, Castellón (Spain).
Amengual, J. C.,& Vidal, E. (1998). Efficient error-correcting viterbi parsing. IEEE Trans. on Pattern Analysis and Machine Intelligence, 20:10, 1109–1116.
Amengual, J. C.,& Vidal, E. (2000). On the estimation of error-correcting parameters. Technical Report DI 01-03/00, Unidad Predepartamental de Informática, Universidad Jaume I, Castellón (Spain).
Baum, L. E.,& Eagon, J. A. (1967). An inequality with applications to statistical estimation for probabilistic functions of markov processes and to a model for ecology. Bulletin American Mathematical Society, 73, 360–363.
Bunke, H.,& Csirik, J. (1995). Parametric string edit distance and its application to pattern recognition. IEEE Trans. on Systems, Man, and Cybernetics, 25:1, 202–206.
Casacuberta, F. (1996). Growth transformations for probabilistic functions of stochastic grammars. International Journal of Pattern Recognition and Artificial Intelligence, 10:3, 183–201.
Chirathamjaree, C.,& Ackroyd, M. H. (1980). A method for the inference of non-recursive context-free grammars. Int. Journal Man-Machine Studies, 12, 379–387.
Dempster, A. P., Laird, N. M.,& Rubin, D. B. (1977). Maximum likelihood from incomplete data via the em algorithm. Royal Statistical Society, 39:1, 1–38.
Díaz-Verdejo, J. E., Peinado, A. M., Rubio, A. J., Segarra, E., Prieto, N.,& Casacuberta, F. (1998). Albayzin: A task-oriented spanish speech corpus. In First International Conference on Language Resources and Evaluation (pp. 497–501), Granada (Spain).
Fu. K. S. (1982). Syntactic Pattern Recognition and Applications. Englewood Cliffs, New Jersey: Prentice Hall.
Gonzalez, R. C.,& Thomason, M. G. (1978). Syntactic Pattern Recognition. An Introduction. Reading, Massachusetts: Addison-Wesley.
Gregor, J.,& Harris, R. S. (1995). String matching with left-to-right networks. Pattern Recognition Letters, 16, 213–218.
Kruskal, J. B. (1983). An overview of sequence comparison. In D. Sankoff& J. B., Kruskal (Eds.), Time Warps, String Edits and Macromolecules: The Theory and Practice of Sequence Comparison (pp. 1–44). Reading, Massachusetts: Addison-Wesley.
Rice, S. V., Bunke, H.,& Nartker, T. A. (1997). Classes of cost functions for string edit distance. Algorithmica, 18, 271–280.
Ristad, E. S.,& Yianilos, P. N. (1998). Learning string-edit distance. IEEE Trans. on Pattern Analysis and Machine Intelligence, 20:5, 522–532.
Rulot, H.,& Vidal, E. (1987). Modelling (sub)string-length-based constraints throught a grammatical inference method. In P. A. Devijver& J. Kittler (Eds.), Patter Recognition: Theory and Applications (pp. 451–459), Springer-Verlag.
Rulot, H.,& Vidal, E. (1988). An efficient algorithm for the inference of circuit-free automata. In G. Ferraté, T. Pavlidis, A. Sarfeliu,& H. Bunke (Eds.), Syntactic and Structural Pattern Recognition (173–184), Springer-Verlag.
Sankoff, D.,& Kruskal, J. B. (Eds.) (1983). Time Warps, String Edits and Macromolecules: The Theory and Practice of Sequence Comparison. Reading, Massachusetts: Addison-Wesley.
Thomason, M. G., Granum, E.,& Blake, R. E. (1986). Experiments in dynamic programming inference of markov networks with strings representing speech data. Pattern Recognition, 19:5, 343–351.
Vidal, E., Casacuberta, F.,& Garcá, P. (1995). Grammatical inference and automatic speech recognition. In A. J. Rubio& J.M. López (Eds.), Speech Recognition and Coding, New Advances and Trends (pp. 174–191), NATO Advanced Study Institute. Berlin: Springer-Verlag.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Amengual, JC., Sanchis, A., Vidal, E. et al. Language Simplification through Error-Correcting and Grammatical Inference Techniques. Machine Learning 44, 143–159 (2001). https://doi.org/10.1023/A:1010832230794
Issue Date:
DOI: https://doi.org/10.1023/A:1010832230794