Abstract
One of the important processing steps for many natural language systems (information extraction, question answering, etc.) is Part-of-speech (PoS) tagging. This issue has been tackled with a number of different approaches in order to resolve this step. In this paper we study the functioning of a Hidden Markov Models (HMM) Spanish PoS tagger using a minimum amount of training corpora. Our PoS tagger is based on HMM where the states are tag pairs that emit words. It is based on transitional and lexical probabilities. This technique has been suggested by Rabiner [11] –and our implementation is influenced by Brants [2]–. We have investigated the best configuration of HMM using a small amount of training data which has about 50,000 words and the maximum precision obtained for an unknown Spanish text was 95.36%.
This research has been partially funded by the Spanish Government under project PROFIT number FIT-340100-2004-14.
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Atserias, J., Carmona, J., Castellón, I., Cervell, S., Civit, M., Márquez, L., Martí, M.A., Padró, L., Placer, R., Rodríguez, H., Taulé, M., Turmo, J.: Morphosyntactic Analysis and Parsing of Unrestricted Spanish Text. In: First International Conference on Language Resources and Evaluation, LREC 1998, pp. 1267–1272 (1998)
Brants, T.: Tnt- a statistical part-of-speech tagger. In: Proceedings of the 6rd Conference on Applied Natural Language Procesing, ANLP, pp. 224–231 (2000)
Brill, E.: Transformation-based error-driven learning of natural language: A case study in part of speech tagging. Computational Linguistics 21, 543–565
Brill, E.: A corpus-based Approach to Language Learning (1993)
Carreras, X., Chao, I., Padró, L., Padró, M.: Freeling: An open-source suite of language analyzers. In: Proceedings of the 4th International Conference on Languge Resources and Evaluation, LREC 2004, pp. 1364–1371 (2004)
Civit, M.: Criterios de etiquetación y desambiguación morfosintáctica de corpus en español. PhD thesis, Linguistics Department,Universitat de Barcelona (2003)
Daelemans, W., Zavrel, J., Berckand, P., Gillis, S.: A memory-based part-ofspeech tagger generator. In: Proceedings of the 4th Workshop on Very Large Corpora, pp. 14–27 (1996)
Figuerola, G., Zazo, F., Rodríguez, E., Alonso, J.: La Recuperación de Información en español y la normalización de términos. Revista Iberoamericana de Inteligencia Artificial VIII(22), 135–145 (2004)
Mérialdo, B.: Tagging English text with a probabilistic model. Computational Linguistics 20(2), 155–171 (1994)
Padró, M., Padró, L.: Developing Competitive HMM PoS Taggers Using Small Training Corpora. ESPAÑA for NATURAL LANGUAGE PROCESSING, EsTAL, 127–136 (2004)
Rabiner, L.R.: A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE 77(2), 257–286 (1989)
Ratnaparkhi, A.: A maximum entropy part-of-speech tagger. In: Proceedings of the 1st Conference on Empirical Methods in Natural Language Procesing, EMNLP, pp. 16–19 (1996)
Schmid, H.: TreeTagger — a language independent part-of-speech tagger. Institut fur Maschinelle Sprachverarbeitung, Universitat Stuttgart (1995)
Viterbi, A.J.: Error bounds for convolutional codes and asymptotically optimal decoding algorithm. IEEE Transactions on Inf. Theory, 260–269 (1967)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ferrández, S., Peral, J. (2005). Investigating the Best Configuration of HMM Spanish PoS Tagger when Minimum Amount of Training Data Is Available. In: Montoyo, A., Muńoz, R., Métais, E. (eds) Natural Language Processing and Information Systems. NLDB 2005. Lecture Notes in Computer Science, vol 3513. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11428817_32
Download citation
DOI: https://doi.org/10.1007/11428817_32
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-26031-8
Online ISBN: 978-3-540-32110-1
eBook Packages: Computer ScienceComputer Science (R0)