Abstract
This paper proposes a hybrid automatic word-spacing system for the Korean language, combining stochastic- and knowledge-based approaches. Our system defines the optimal splitting points of an input sentence using two simple parameters: (a) relative word frequency and (b) Syllable n-gram statistics, extracted from large processed corpora that contain 33,643,884 word-tokens. Whereas this method efficiently resolves problems due to eventual data noise using processed training data, and data sparseness using Syllabic n-gram statistics and large corpora, there still remains the problem of processing unseen words, which can hardly be overcome even with a huge corpus. Therefore, this study compensates for the stochastic-based approach, (a) dynamically expanding candidate words with longest-radix selection among possible morphemes and (b) adopting inequivalent treatment between major lexical categories and minor lexical categories. The current combined model remedies drawbacks of the stochastic-based word-spacing algorithm and shows encouraging results: it obtained 97.51% precision in word-unit correction from the external test data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Chung, Y.M., Lee, J.Y.: Automatic Word-segmentation at Line-breaks for Korean Text Processing. In: Proceedings of 6th Conference of Korean Society for Information Management, pp. 21–24 (1999)
Kang, M.Y., Kwon, H.C.: Improving Word Spacing Correction Methods for Efficient Text Processing. Proceedings of the Korean Information Science Society (B) 30(1), 486–488 (2003)
Kang, M.Y., Choi, S.J., Yoon, A.S., Kwon, H.C.: Stochastic Word-Spacing System with Dynamic Increase of Word List. In: Proceeding of the First International Joint Conference on Natural Language Processing (2004) (to appear)
Kang, S.S.: Automatic Segmentation for Hangul Sentences. In: Proceeding of the 10th Conference on Hangul and Korean Information Processing, pp. 137–142 (1998)
Kang, S.S., Woo, C.W.: Automatic Segmentation of Words Using Syllable Bigram Statistics. In: Proceedings of 6th Natural Language Processing Pacific Rim Symposium, pp. 729–732 (2001)
Kang, S.S.: Korean Morphological Analysis and Information Retrieval. Hongleunggwahag Publisher, Seoul (2002)
Kim, S.N., Nam, H.S., Kwon, H.C.: Correction Methods of Spacing Words for Improving the Korean Spelling and Grammar Checkers. In: Proceedings of 5th Natural Language Processing Pacific Rim Symposium, pp. 415–419 (1999)
Lee, D.G., Lee, S.Z., Lim, H.S., Rim, H.C.: Two Statistical Models for Automatic Word Spacing of Korean Sentences. Journal of KISS(B): Software and Applications 30(4), 358–370 (2003)
Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge (2001)
Sim, C.M., Kwon, H.C.: Implementation of a Korean Spelling Checker Based on Collocation of Words. Journal of KISS(B): Software and Applications 23(7), 776–785 (1996)
Sim, K.S.: Automated Word-Segmentation for Korean Using Mutual Information of Syllables. Journal of KISS(B): Software and Applications 23(9), 991–1000 (1996)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Kang, My., Choi, Sw., Kwon, Hc. (2004). A Hybrid Approach to Automatic Word-spacing in Korean. In: Orchard, B., Yang, C., Ali, M. (eds) Innovations in Applied Artificial Intelligence. IEA/AIE 2004. Lecture Notes in Computer Science(), vol 3029. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24677-0_30
Download citation
DOI: https://doi.org/10.1007/978-3-540-24677-0_30
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-22007-7
Online ISBN: 978-3-540-24677-0
eBook Packages: Springer Book Archive