Combined Word-Spacing Method for Disambiguating Korean Texts

Kang, Mi-young; Yoon, Aesun; Kwon, Hyuk-chul

doi:10.1007/978-3-540-30549-1_49

Mi-young Kang²⁰,
Aesun Yoon²⁰ &
Hyuk-chul Kwon²⁰

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3339))

Included in the following conference series:

Australasian Joint Conference on Artificial Intelligence

2622 Accesses
3 Citations

Abstract

In this paper, we propose an automatic word-spacing method for a Korean text preprocessing system in resolving the problem of context-dependent word-spacing. The current method combines the stochastic-based method and partial parsing. First, the stochastic method splits an input sentence into a candidate-word sequence using word unigrams and syllable bigrams. Second, the system engages a partial parsing module based on the asymmetric relation between the candidate-words. The partial parsing module manages the governing relationship using words which are incorporated into the knowledge base as having a high-probability of spacing-error words. These elements serve as parsing trigger points based on their linguistic information, and they deter-mine the parsing direction as well as the parsing scope. Combining the stochastic- and linguistic-based methods, the current automatic word-spacing system becomes robust against the problem of context-dependant word-spacing. An average 8.98% amelioration of the total error rate is obtained for inner and external data.

Access provided by Autonomous University of Puebla. Download to read the full chapter text

Chapter PDF

Using Kazakh Morphology Information to Improve Word Alignment for SMT

LG-Starship: A Framework for Text Analysis

Gut, Besser, Chunker – Selecting the Best Models for Text Chunking with Voting

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Comrie, B.: Language Universals and Linguistic Typology. Blackwell, Malden (1989)
Google Scholar
Grefenstette, G.: What is a Word, What is a Sentence, Problems of Tokenization Proceedings of the conference on computational lexicography and text research (1994)
Google Scholar
Kang, M.Y., Yoon, A.S., Kwon, H.C.: Improving Partial Parsing Based on Error Pattern Analysis for Korean Grammar Checker. In: TALIP, vol. 2(4), pp. 301–323. ACM, New York (2003)
Google Scholar
Kang, M.Y., Choi, S.W., Kwon, H.C.: A Hybrid Approach to Automatic Word-spacing in Korean. In: Orchard, B., Yang, C., Ali, M. (eds.) IEA/AIE 2004. LNCS (LNAI), vol. 3029, pp. 284–294. Springer, Heidelberg (2004)
Chapter Google Scholar
Kang, S.S., Woo, C.W.: Automatic Segmentation of Words Using Syllable Bigram Statistics. In: Proceedings of 6th Natural Language Processing Pacific Rim Symposium, pp. 729–732 (2001)
Google Scholar
Kang, S.S.: Korean Morphological Analysis and Information Retrieval. Hongleunggwahag Publisher, Seoul (2002)
Google Scholar
Lee, D.G., Lee, S.Z., Lim, H.S., Rim, H.C.: Two Statistical Models for Automatic Word Spacing of Korean Sentences. Journal of KISS(B): Software and Applications 30(4), 358–370 (2003)
Google Scholar
Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processin. MIT Press, Cambridge (2001)
Google Scholar
Sim, C.M., Kwon, H.C.: Implementation of a Korean Spelling Checker Based on Collocation of Words. Journal of KISS(B): Software and Applications 23(7), 776–785 (1996)
Google Scholar
Teahan, W.J., McNab, R., Wen, Y., Witten, I.H.: A compression-based algorithm for Chinese word segmentation. Computational Linguistics 26(3), 375–393 (2000)
Article Google Scholar
Tsutsumi, J., Nitta, T., Ono, K., Jiang, S.D., Nakaishi, M.: Segmenting a Sentence into Morphemes using Statistic Information between Words. In: Proceedings of Coling 1994, pp. 227–233 (1994)
Google Scholar
Kim, S.T., et al.: Korean Standard Pronunciation Dictionary, Eomungak (1993)
Google Scholar

Download references

Author information

Authors and Affiliations

Korean Language Processing Lab., School of Electrical & Computer Engineering, Pusan National University, San 30, Jangjeon-dong, 609-735, Busan, Korea
Mi-young Kang, Aesun Yoon & Hyuk-chul Kwon

Authors

Mi-young Kang
View author publications
You can also search for this author in PubMed Google Scholar
Aesun Yoon
View author publications
You can also search for this author in PubMed Google Scholar
Hyuk-chul Kwon
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Faculty of Information Technology, Monash University, VIC 3800, Australia
Geoffrey I. Webb
Science, Engineering and Technology Portfolio, Royal Melbourne Institute of Technology, VIC 3001, Melbourne, Australia
Xinghuo Yu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kang, My., Yoon, A., Kwon, Hc. (2004). Combined Word-Spacing Method for Disambiguating Korean Texts. In: Webb, G.I., Yu, X. (eds) AI 2004: Advances in Artificial Intelligence. AI 2004. Lecture Notes in Computer Science(), vol 3339. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30549-1_49

Download citation

DOI: https://doi.org/10.1007/978-3-540-30549-1_49
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-24059-4
Online ISBN: 978-3-540-30549-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Combined Word-Spacing Method for Disambiguating Korean Texts

Abstract

Chapter PDF

Similar content being viewed by others

Using Kazakh Morphology Information to Improve Word Alignment for SMT

LG-Starship: A Framework for Text Analysis

Gut, Besser, Chunker – Selecting the Best Models for Text Chunking with Voting

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Combined Word-Spacing Method for Disambiguating Korean Texts

Abstract

Chapter PDF

Similar content being viewed by others

Using Kazakh Morphology Information to Improve Word Alignment for SMT

LG-Starship: A Framework for Text Analysis

Gut, Besser, Chunker – Selecting the Best Models for Text Chunking with Voting

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation