Abstract
Chinese characters have a complex and hierarchical graphical structure carrying both semantic and phonetic information. We use this structure to enhance the text model and obtain better results in standard NLP operations. First of all, to tackle the problem of graphical variation we define allographic classes of characters. Next, the relation of inclusion of a subcharacter in a characters, provides us with a directed graph of allographic classes. We provide this graph with two weights: semanticity (semantic relation between subcharacter and character) and phoneticity (phonetic relation) and calculate “most semantic subcharacter paths” for each character. Finally, adding the information contained in these paths to unigrams we claim to increase the efficiency of text mining methods. We evaluate our method on a text classification task on two corpora (Chinese and Japanese) of a total of 18 million characters and get an improvement of 3% on an already high baseline of 89.6% precision, obtained by a linear SVM classifier. Other possible applications and perspectives of the system are discussed.
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Allen, J.D., et al. (eds.): The Unicode Standard, Version 6.0. Unicode Consortium (2011)
Fujimura, O., Kagaya, R.: Structural patterns of Chinese characters. In: Proceedings of the International Conference on Computational Linguistics, Sånga-Säby, Sweden, pp. 131–148 (1969)
Wang, J.C.S.: Toward a generative grammar of Chinese character structure and stroke order. PhD thesis, University of Wisconsin-Madison (1983)
Dürst, M.J.: Coordinate-independent font description using Kanji as an example. Electronic Publishing 6(3), 133–143 (1993)
Chu, B.F.: 漢字基因朱邦復漢字基因工程 (Genetic engineering of Chinese characters) (2003), http://cbflabs.com/down/show.php?id=26
Moro, S.: Surface or essence: Beyond the coded character set model. In: Proceedings of the Glyph and Typesetting Workshop, Kyoto, Japan, pp. 26–35 (2003)
Sproat, R.: A Computational Theory of Writing Systems. Studies in Natural Language Processing. Cambridge University Press (2000)
Peebles, D.G.: SCML: A Structural Representation for Chinese Characters. PhD thesis, Dartmouth College, TR2007–592 (2007)
Bishop, T., Cook, R.: Wenlin CDL: Character Description Language. Multilingual 18, 62–68 (2007)
Haralambous, Y.: Seeking meaning in a space made out of strokes, radicals, characters and compounds. In: Proceedings of ISSM 2010-2011, Aizu-Wakamatsu, Japan (2011)
Qin, L., Tong, C.S., Yin, L., Ling, L.N.: Decomposition for ISO/IEC 10646 ideographic characters. In: COLING 2002: Proceedings of the 3rd Workshop on Asian Language Resources and International Standardization. Association for Computational Linguistics (2002)
Dai, R., Liu, C., Xiao, B.: Chinese character recognition: history, status and prospects. Frontiers of Computer Science in China 1, 126–136 (2007)
Fujiwara, Y., Suzuki, Y., Morioka, T.: Network of words. Artificial Life and Robotics 7, 160–163 (2004)
Li, J., Zhou, J.: Chinese character structure analysis based on complex networks. Physica A: Statistical Mechanics and its Applications 380, 629–638 (2007)
Rocha, J., Fujisawa, H.: Substructure Shape Analysis for Kanji Character Recognition. In: Perner, P., Rosenfeld, A., Wang, P. (eds.) SSPR 1996. LNCS, vol. 1121, pp. 361–370. Springer, Heidelberg (1996)
Zhou, L., Liu, Q.: A character-net based Chinese text segmentation method. In: SEMANET 2002 Proceedings of the 2002 Workshop on Building and Using Semantic Networks, pp. 1–6. Association for Computational Linguistics (2002)
Yu, S., Liu, H., Xu, C.: Statistical properties of Chinese phonemic networks. Physica A: Statistical Mechanics and its Applications 390, 1370–1380 (2011)
Hsieh, S.K.: Hanzi, Concept and Computation: A Preliminary Survey of Chinese Characters as a Knowledge Resource in NLP. PhD thesis, Universität Tübingen (2006)
Chou, Y.-M., Hsieh, S.-K., Huang, C.-R.: Hanzi Grid: Toward a Knowledge Infrastructure for Chinese Character-Based Cultures. In: Ishida, T., R. Fussell, S., T. J. M. Vossen, P. (eds.) IWIC 2007. LNCS, vol. 4568, pp. 133–145. Springer, Heidelberg (2007)
Taft, M., Zhu, X.: Submorphemic processing in reading Chinese. Journal of Experimental Psychology: Learning, Memory and Cognition 23, 761–775 (1997)
Williams, C., Bever, T.: Chinese character decoding: a semantic bias? Read Writ. 23, 589–605 (2010)
Tamaoka, K., Yamada, H.: The effects of stroke order and radicals on the knowledge of Japanese Kanji orthography, phonology and semantics. Psychologia 43, 199–210 (2000)
Zhao, S., Baldauf Jr., R.B.: Planning Chinese Characters. Reaction, Evolution or Revolution? Language Policy, vol. 9. Springer (2008)
Guder-Manitius, A.: Sinographemdidaktik. Aspekte einer systematischen Vermittlung der chinesischen Schrift im Unterricht Chinesisch als Fremdsprache. SinoLinguistica, vol. 7. Julius Groos Verlag, Tübingen (1999)
Jenkins, J.H., Cook, R.: Unicode Standard Annex #38. Unicode Han Database. Technical report, The Unicode Consortium, property kHanyuPinlu (2010)
Chikamatsu, N., Yokoyama, S., Nozaki, H., Long, E., Fukuda, S.: A Japanese logographic character frequency list for cognitive science research. Behavior Research Methods, Instruments, & Computers 32(3), 482–500 (2000)
Sharoff, S.: Creating general-purpose corpora using automated search engine queries. In: Baroni, M., Bernardini, S. (eds.) WaCky! Working papers on the Web as Corpus, Bologna, GEDIT (2006), http://wackybook.sslmit.unibo.it/pdfs/wackybook.zip
Morioka, T.: CHISE: Character Processing Based on Character Ontology. In: Tokunaga, T., Ortega, A. (eds.) LKR 2008. LNCS (LNAI), vol. 4938, pp. 148–162. Springer, Heidelberg (2008)
Schindelin, C.: Zur Phonetizität chinesischer Schriftzeichen in der Didaktik des Chinesischen als Fremdsprache. SinoLinguistica, vol. 13. Iudicium, München (2007)
Newman, M.J.: Networks. An introduction. Oxford University Press (2010)
Chang, C.H., Li, S.Y., Lin, S., Huang, C.Y., Chen, J.M.: 以最佳化及機率分佈判斷漢字聲符之研究 (Automatic identification of phonetic complements for Chinese characters based on optimization and probability distribution). In: Proceedings of the 22nd Conference on Computational Linguistics and Speech Processing (ROCLING 2010), Puli, Nantou, Taiwan, pp. 199–209 (2010)
Sriram, S., Talukdar, P.P., Badaskar, S., Bali, K., Ramakrishnan, A.G.: Phonetic distance based cross-lingual search. In: Proc. of the 5th International Conf. on Natural Language Processing (KBCS 2004), Hyderabad, India (2004)
Kondrak, G.: A new algorithm for the alignment of phonetic sequences. In: NAACL 2000: Proceedings of the 1st North American Chapter of the Association for Computational Linguistics Conference (2000)
Huang, C.R.: Sinica BOW: Integrating bilingual WordNet and SUMO ontology. In: International Conference on Natural Language Processing and Knowledge Engineering, pp. 825–826 (2003)
Gao, Z., et al.: Chinese WordNet (2008), http://www.aturstudio.com/wordnet/windex.php
Isahara, H., Bond, F., Uchimoto, K., Utiyama, M., Kanzaki, K.: Development of the Japanese WordNet. In: Proceedings of the Sixth International Conference on Language Resources and Evaluation, LREC 2008 (2008)
Jiang, J.J., Conrath, D.W.: Semantic similarity based on corpus statistics and lexical taxonomy. In: Proceedings of International Conference on Research in Computational Linguistics, Taiwan (1997)
Lin, D.: An information-theoretic definition of similarity. In: Proceedings of 15th International Conference on Machine Learning, Madison WI (1998)
Resnik, P.: Using information content to evaluate semantic similarity in a taxonomy. In: Proceedings of the 14th International Joint Conference on Artificial Intelligence, Montréal, pp. 448–453 (1995)
Sogou: 互联网语料库 (SogouT) (2008), http://www.sogou.com/labs/dl/t.html
Reuters: 過去ニュース (2007-2012), http://www.reuters.com/resources/archive/jp/index.html
Zhang, H.J., Shi, S.M., Feng, C., Huang, H.Y.: A method of part-of-speech guessing of Chinese unknown words based on combined features. In: International Conference on Machine Learning and Cybernetics, pp. 328–332 (2009)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Haralambous, Y. (2013). New Perspectives in Sinographic Language Processing through the Use of Character Structure. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2013. Lecture Notes in Computer Science, vol 7816. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-37247-6_17
Download citation
DOI: https://doi.org/10.1007/978-3-642-37247-6_17
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-37246-9
Online ISBN: 978-3-642-37247-6
eBook Packages: Computer ScienceComputer Science (R0)