Abstract
This paper presents a novel method to recognize Chinese unknown word from short texts corpus, which is based our observation of both single-character and affix models of Chinese unknown word. In our approach, we collect some news titles of a news site and view these titles as short texts. There are three steps in our approach: First, all collected news titles are segmented with ICTCLAS, and statistics of potential unknown words are conducted. Second, all potential unknown words are classified into either single-character model or affix model based on structures of unknown word. Some filtration methods are used to filter garbage strings. Finally, unknown word is extracted according to the frequencies of word. We have got the excellent precision and the recalling rates, especially for the single-character model. The experiment results show that our approach is simple yet effective.
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
References
http://ictclas.org/ (September 2011)
Li, H., Huang, C.-N., Gao, J., Fan, X.-z.: The Use of SVM for Chinese New Word Identification. In: Su, K.-Y., Tsujii, J., Lee, J.-H., Kwong, O.Y. (eds.) IJCNLP 2004. LNCS (LNAI), vol. 3248, pp. 723–732. Springer, Heidelberg (2005)
Zhang, H.P., Liu, Q.: Automatic Recognition of Chinese Unknown Words Based on Roles Tagging. Chinese Journal of Computers, 85–91 (January 2004)
Wu, A., Jiang, Z.X.: Statistically-Enhanced New Word Identification in a Rule-Based Chinese System. In: Proceedings of the Second Chinese Language Processing Workshop, pp. 46–51 (2000)
Isozaki, H.: Japanese named entity recognition based on a simple rule generator and decision tree learning. In: Proceedings of the 39th Annual Meeting on Association for Computational Linguistics, pp. 314–321 (2001)
Chen, K.J., Ma, W.Y.: Unknown word extraction for Chinese documents. In: The 19th International Conference on Computational Linguistics, pp. 169–175 (2002)
Meng, Y., Yu, H., Nishino, F.: Chinese new word identification based on character parsing model. In: Proceedings of First International Joint Conference on Natural Language Proceeding Sanya, Hainan Island, China, pp. 489–496 (2004)
Xu, Y.S., Wang, X., Tang, B.Z.: Chinese Unknown Word Recognition using improved Conditional Random Fields. In: Eighth International Conference on Intelligent Systems Design and Applications, pp. 363–367 (2008)
Qin, H.W., Bu, F.L.: Research on a Feature of Chinese New word Identification. Computer Engineering (2004)
Lv, H.L.: Chinese New Word Identification Based on Large-scale Corpus. Dalian University of Technology (2008)
Cui, S.Q., Liu, Q., Meng, Y., Yu, H., Nishino, F.: New Word Detection Based on Large-Scale Corpus. Journal of Computer Research and Development (2006)
Ding, J.L., Ci, X., Huang, J.X.: Approach of Internet New Word Identification Based on Immune Genetic Algorithm. Computer Science. 240–245 (Janruary 2011)
Zhang, Y., Sun, M., Zhang, Y.: Chinese New Word Detection from Query Logs. In: Cao, L., Zhong, J., Feng, Y. (eds.) ADMA 2010, Part II. LNCS, vol. 6441, pp. 233–243. Springer, Heidelberg (2010)
Zhu, Q., Cheng, X.Y., Gao, Z.J.: The Recognition Method of Unknown Chinese Words in Fragments Based On Mutual Information. Journal of Convergence Information Technology (2010)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Jiang, X., Wang, L., Cao, Y., Lu, Z. (2011). Automatic Recognition of Chinese Unknown Word for Single-Character and Affix Models. In: Wang, Y., Li, T. (eds) Knowledge Engineering and Management. Advances in Intelligent and Soft Computing, vol 123. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-25661-5_56
Download citation
DOI: https://doi.org/10.1007/978-3-642-25661-5_56
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-25660-8
Online ISBN: 978-3-642-25661-5
eBook Packages: EngineeringEngineering (R0)