Abstract
In this paper, we propose an automatic word-spacing method for a Korean text preprocessing system in resolving the problem of context-dependent word-spacing. The current method combines the stochastic-based method and partial parsing. First, the stochastic method splits an input sentence into a candidate-word sequence using word unigrams and syllable bigrams. Second, the system engages a partial parsing module based on the asymmetric relation between the candidate-words. The partial parsing module manages the governing relationship using words which are incorporated into the knowledge base as having a high-probability of spacing-error words. These elements serve as parsing trigger points based on their linguistic information, and they deter-mine the parsing direction as well as the parsing scope. Combining the stochastic- and linguistic-based methods, the current automatic word-spacing system becomes robust against the problem of context-dependant word-spacing. An average 8.98% amelioration of the total error rate is obtained for inner and external data.
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Comrie, B.: Language Universals and Linguistic Typology. Blackwell, Malden (1989)
Grefenstette, G.: What is a Word, What is a Sentence, Problems of Tokenization Proceedings of the conference on computational lexicography and text research (1994)
Kang, M.Y., Yoon, A.S., Kwon, H.C.: Improving Partial Parsing Based on Error Pattern Analysis for Korean Grammar Checker. In: TALIP, vol. 2(4), pp. 301–323. ACM, New York (2003)
Kang, M.Y., Choi, S.W., Kwon, H.C.: A Hybrid Approach to Automatic Word-spacing in Korean. In: Orchard, B., Yang, C., Ali, M. (eds.) IEA/AIE 2004. LNCS (LNAI), vol. 3029, pp. 284–294. Springer, Heidelberg (2004)
Kang, S.S., Woo, C.W.: Automatic Segmentation of Words Using Syllable Bigram Statistics. In: Proceedings of 6th Natural Language Processing Pacific Rim Symposium, pp. 729–732 (2001)
Kang, S.S.: Korean Morphological Analysis and Information Retrieval. Hongleunggwahag Publisher, Seoul (2002)
Lee, D.G., Lee, S.Z., Lim, H.S., Rim, H.C.: Two Statistical Models for Automatic Word Spacing of Korean Sentences. Journal of KISS(B): Software and Applications 30(4), 358–370 (2003)
Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processin. MIT Press, Cambridge (2001)
Sim, C.M., Kwon, H.C.: Implementation of a Korean Spelling Checker Based on Collocation of Words. Journal of KISS(B): Software and Applications 23(7), 776–785 (1996)
Teahan, W.J., McNab, R., Wen, Y., Witten, I.H.: A compression-based algorithm for Chinese word segmentation. Computational Linguistics 26(3), 375–393 (2000)
Tsutsumi, J., Nitta, T., Ono, K., Jiang, S.D., Nakaishi, M.: Segmenting a Sentence into Morphemes using Statistic Information between Words. In: Proceedings of Coling 1994, pp. 227–233 (1994)
Kim, S.T., et al.: Korean Standard Pronunciation Dictionary, Eomungak (1993)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Kang, My., Yoon, A., Kwon, Hc. (2004). Combined Word-Spacing Method for Disambiguating Korean Texts. In: Webb, G.I., Yu, X. (eds) AI 2004: Advances in Artificial Intelligence. AI 2004. Lecture Notes in Computer Science(), vol 3339. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30549-1_49
Download citation
DOI: https://doi.org/10.1007/978-3-540-30549-1_49
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-24059-4
Online ISBN: 978-3-540-30549-1
eBook Packages: Computer ScienceComputer Science (R0)