A Hybrid Approach to Automatic Word-spacing in Korean

Kang, Mi-young; Choi, Sung-woo; Kwon, Hyuk-chul

doi:10.1007/978-3-540-24677-0_30

Mi-young Kang¹⁹,
Sung-woo Choi¹⁹ &
Hyuk-chul Kwon¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3029))

Included in the following conference series:

International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems

1694 Accesses
1 Citations

Abstract

This paper proposes a hybrid automatic word-spacing system for the Korean language, combining stochastic- and knowledge-based approaches. Our system defines the optimal splitting points of an input sentence using two simple parameters: (a) relative word frequency and (b) Syllable n-gram statistics, extracted from large processed corpora that contain 33,643,884 word-tokens. Whereas this method efficiently resolves problems due to eventual data noise using processed training data, and data sparseness using Syllabic n-gram statistics and large corpora, there still remains the problem of processing unseen words, which can hardly be overcome even with a huge corpus. Therefore, this study compensates for the stochastic-based approach, (a) dynamically expanding candidate words with longest-radix selection among possible morphemes and (b) adopting inequivalent treatment between major lexical categories and minor lexical categories. The current combined model remedies drawbacks of the stochastic-based word-spacing algorithm and shows encouraging results: it obtained 97.51% precision in word-unit correction from the external test data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 74.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Stemming and Segmentation for Classical Tibetan

A Study on the Integrated Database of Sino-Korean Words in

Indonesian graphemic syllabification using a nearest neighbour classifier and recovery procedure

Article 08 November 2018

References

Chung, Y.M., Lee, J.Y.: Automatic Word-segmentation at Line-breaks for Korean Text Processing. In: Proceedings of 6th Conference of Korean Society for Information Management, pp. 21–24 (1999)
Google Scholar
Kang, M.Y., Kwon, H.C.: Improving Word Spacing Correction Methods for Efficient Text Processing. Proceedings of the Korean Information Science Society (B) 30(1), 486–488 (2003)
Google Scholar
Kang, M.Y., Choi, S.J., Yoon, A.S., Kwon, H.C.: Stochastic Word-Spacing System with Dynamic Increase of Word List. In: Proceeding of the First International Joint Conference on Natural Language Processing (2004) (to appear)
Google Scholar
Kang, S.S.: Automatic Segmentation for Hangul Sentences. In: Proceeding of the 10th Conference on Hangul and Korean Information Processing, pp. 137–142 (1998)
Google Scholar
Kang, S.S., Woo, C.W.: Automatic Segmentation of Words Using Syllable Bigram Statistics. In: Proceedings of 6th Natural Language Processing Pacific Rim Symposium, pp. 729–732 (2001)
Google Scholar
Kang, S.S.: Korean Morphological Analysis and Information Retrieval. Hongleunggwahag Publisher, Seoul (2002)
Google Scholar
Kim, S.N., Nam, H.S., Kwon, H.C.: Correction Methods of Spacing Words for Improving the Korean Spelling and Grammar Checkers. In: Proceedings of 5th Natural Language Processing Pacific Rim Symposium, pp. 415–419 (1999)
Google Scholar
Lee, D.G., Lee, S.Z., Lim, H.S., Rim, H.C.: Two Statistical Models for Automatic Word Spacing of Korean Sentences. Journal of KISS(B): Software and Applications 30(4), 358–370 (2003)
Google Scholar
Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge (2001)
Google Scholar
Sim, C.M., Kwon, H.C.: Implementation of a Korean Spelling Checker Based on Collocation of Words. Journal of KISS(B): Software and Applications 23(7), 776–785 (1996)
Google Scholar
Sim, K.S.: Automated Word-Segmentation for Korean Using Mutual Information of Syllables. Journal of KISS(B): Software and Applications 23(9), 991–1000 (1996)
Google Scholar

Download references

Author information

Authors and Affiliations

Korean Language Processing Lab, School of Electrical & Computer Engineering, Pusan National University, San 30, Jangjeon-dong, 609-735, Busan, Korea
Mi-young Kang, Sung-woo Choi & Hyuk-chul Kwon

Authors

Mi-young Kang
View author publications
You can also search for this author in PubMed Google Scholar
Sung-woo Choi
View author publications
You can also search for this author in PubMed Google Scholar
Hyuk-chul Kwon
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute for Information Technology, National Research Council of Canada, 1200 Montreal Read, M-50, K1A 0R6, Ottawa, Ontario, Canada
Bob Orchard
Institute for Information Technology, National Research Council, Canada
Chunsheng Yang
Department of Computer Science, Texas State University-San Marcos, Nueces 247, 601 University Drive, TX 78666-4616, San Marcos, USA
Moonis Ali

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kang, My., Choi, Sw., Kwon, Hc. (2004). A Hybrid Approach to Automatic Word-spacing in Korean. In: Orchard, B., Yang, C., Ali, M. (eds) Innovations in Applied Artificial Intelligence. IEA/AIE 2004. Lecture Notes in Computer Science(), vol 3029. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24677-0_30

Download citation

DOI: https://doi.org/10.1007/978-3-540-24677-0_30
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-22007-7
Online ISBN: 978-3-540-24677-0
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics

A Hybrid Approach to Automatic Word-spacing in Korean

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Stemming and Segmentation for Classical Tibetan

A Study on the Integrated Database of Sino-Korean Words in

Indonesian graphemic syllabification using a nearest neighbour classifier and recovery procedure

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

A Hybrid Approach to Automatic Word-spacing in Korean

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Stemming and Segmentation for Classical Tibetan

A Study on the Integrated Database of Sino-Korean Words in

Indonesian graphemic syllabification using a nearest neighbour classifier and recovery procedure

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation