New Perspectives in Sinographic Language Processing through the Use of Character Structure

Haralambous, Yannis

doi:10.1007/978-3-642-37247-6_17

Yannis Haralambous¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7816))

Included in the following conference series:

International Conference on Intelligent Text Processing and Computational Linguistics

2248 Accesses

Abstract

Chinese characters have a complex and hierarchical graphical structure carrying both semantic and phonetic information. We use this structure to enhance the text model and obtain better results in standard NLP operations. First of all, to tackle the problem of graphical variation we define allographic classes of characters. Next, the relation of inclusion of a subcharacter in a characters, provides us with a directed graph of allographic classes. We provide this graph with two weights: semanticity (semantic relation between subcharacter and character) and phoneticity (phonetic relation) and calculate “most semantic subcharacter paths” for each character. Finally, adding the information contained in these paths to unigrams we claim to increase the efficiency of text mining methods. We evaluate our method on a text classification task on two corpora (Chinese and Japanese) of a total of 18 million characters and get an improvement of 3% on an already high baseline of 89.6% precision, obtained by a linear SVM classifier. Other possible applications and perspectives of the system are discussed.

Access provided by Autonomous University of Puebla. Download to read the full chapter text

Chapter PDF

Chinese Word Segmentation with Character Abstraction

Extracting the Main Content of Web Documents Based on Character Encoding and a Naive Smoothing Method

Asian Character Recognition

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Allen, J.D., et al. (eds.): The Unicode Standard, Version 6.0. Unicode Consortium (2011)
Google Scholar
Fujimura, O., Kagaya, R.: Structural patterns of Chinese characters. In: Proceedings of the International Conference on Computational Linguistics, Sånga-Säby, Sweden, pp. 131–148 (1969)
Google Scholar
Wang, J.C.S.: Toward a generative grammar of Chinese character structure and stroke order. PhD thesis, University of Wisconsin-Madison (1983)
Google Scholar
Dürst, M.J.: Coordinate-independent font description using Kanji as an example. Electronic Publishing 6(3), 133–143 (1993)
Google Scholar
Chu, B.F.: 漢字基因朱邦復漢字基因工程 (Genetic engineering of Chinese characters) (2003), http://cbflabs.com/down/show.php?id=26
Moro, S.: Surface or essence: Beyond the coded character set model. In: Proceedings of the Glyph and Typesetting Workshop, Kyoto, Japan, pp. 26–35 (2003)
Google Scholar
Sproat, R.: A Computational Theory of Writing Systems. Studies in Natural Language Processing. Cambridge University Press (2000)
Google Scholar
Peebles, D.G.: SCML: A Structural Representation for Chinese Characters. PhD thesis, Dartmouth College, TR2007–592 (2007)
Google Scholar
Bishop, T., Cook, R.: Wenlin CDL: Character Description Language. Multilingual 18, 62–68 (2007)
Google Scholar
Haralambous, Y.: Seeking meaning in a space made out of strokes, radicals, characters and compounds. In: Proceedings of ISSM 2010-2011, Aizu-Wakamatsu, Japan (2011)
Google Scholar
Qin, L., Tong, C.S., Yin, L., Ling, L.N.: Decomposition for ISO/IEC 10646 ideographic characters. In: COLING 2002: Proceedings of the 3rd Workshop on Asian Language Resources and International Standardization. Association for Computational Linguistics (2002)
Google Scholar
Dai, R., Liu, C., Xiao, B.: Chinese character recognition: history, status and prospects. Frontiers of Computer Science in China 1, 126–136 (2007)
Article Google Scholar
Fujiwara, Y., Suzuki, Y., Morioka, T.: Network of words. Artificial Life and Robotics 7, 160–163 (2004)
Article Google Scholar
Li, J., Zhou, J.: Chinese character structure analysis based on complex networks. Physica A: Statistical Mechanics and its Applications 380, 629–638 (2007)
Article Google Scholar
Rocha, J., Fujisawa, H.: Substructure Shape Analysis for Kanji Character Recognition. In: Perner, P., Rosenfeld, A., Wang, P. (eds.) SSPR 1996. LNCS, vol. 1121, pp. 361–370. Springer, Heidelberg (1996)
Chapter Google Scholar
Zhou, L., Liu, Q.: A character-net based Chinese text segmentation method. In: SEMANET 2002 Proceedings of the 2002 Workshop on Building and Using Semantic Networks, pp. 1–6. Association for Computational Linguistics (2002)
Google Scholar
Yu, S., Liu, H., Xu, C.: Statistical properties of Chinese phonemic networks. Physica A: Statistical Mechanics and its Applications 390, 1370–1380 (2011)
Article Google Scholar
Hsieh, S.K.: Hanzi, Concept and Computation: A Preliminary Survey of Chinese Characters as a Knowledge Resource in NLP. PhD thesis, Universität Tübingen (2006)
Google Scholar
Chou, Y.-M., Hsieh, S.-K., Huang, C.-R.: Hanzi Grid: Toward a Knowledge Infrastructure for Chinese Character-Based Cultures. In: Ishida, T., R. Fussell, S., T. J. M. Vossen, P. (eds.) IWIC 2007. LNCS, vol. 4568, pp. 133–145. Springer, Heidelberg (2007)
Chapter Google Scholar
Taft, M., Zhu, X.: Submorphemic processing in reading Chinese. Journal of Experimental Psychology: Learning, Memory and Cognition 23, 761–775 (1997)
Article Google Scholar
Williams, C., Bever, T.: Chinese character decoding: a semantic bias? Read Writ. 23, 589–605 (2010)
Article Google Scholar
Tamaoka, K., Yamada, H.: The effects of stroke order and radicals on the knowledge of Japanese Kanji orthography, phonology and semantics. Psychologia 43, 199–210 (2000)
Google Scholar
Zhao, S., Baldauf Jr., R.B.: Planning Chinese Characters. Reaction, Evolution or Revolution? Language Policy, vol. 9. Springer (2008)
Google Scholar
Guder-Manitius, A.: Sinographemdidaktik. Aspekte einer systematischen Vermittlung der chinesischen Schrift im Unterricht Chinesisch als Fremdsprache. SinoLinguistica, vol. 7. Julius Groos Verlag, Tübingen (1999)
Google Scholar
Jenkins, J.H., Cook, R.: Unicode Standard Annex #38. Unicode Han Database. Technical report, The Unicode Consortium, property kHanyuPinlu (2010)
Google Scholar
Chikamatsu, N., Yokoyama, S., Nozaki, H., Long, E., Fukuda, S.: A Japanese logographic character frequency list for cognitive science research. Behavior Research Methods, Instruments, & Computers 32(3), 482–500 (2000)
Article Google Scholar
Sharoff, S.: Creating general-purpose corpora using automated search engine queries. In: Baroni, M., Bernardini, S. (eds.) WaCky! Working papers on the Web as Corpus, Bologna, GEDIT (2006), http://wackybook.sslmit.unibo.it/pdfs/wackybook.zip
Morioka, T.: CHISE: Character Processing Based on Character Ontology. In: Tokunaga, T., Ortega, A. (eds.) LKR 2008. LNCS (LNAI), vol. 4938, pp. 148–162. Springer, Heidelberg (2008)
Chapter Google Scholar
Schindelin, C.: Zur Phonetizität chinesischer Schriftzeichen in der Didaktik des Chinesischen als Fremdsprache. SinoLinguistica, vol. 13. Iudicium, München (2007)
Google Scholar
Newman, M.J.: Networks. An introduction. Oxford University Press (2010)
Google Scholar
Chang, C.H., Li, S.Y., Lin, S., Huang, C.Y., Chen, J.M.: 以最佳化及機率分佈判斷漢字聲符之研究 (Automatic identification of phonetic complements for Chinese characters based on optimization and probability distribution). In: Proceedings of the 22nd Conference on Computational Linguistics and Speech Processing (ROCLING 2010), Puli, Nantou, Taiwan, pp. 199–209 (2010)
Google Scholar
Sriram, S., Talukdar, P.P., Badaskar, S., Bali, K., Ramakrishnan, A.G.: Phonetic distance based cross-lingual search. In: Proc. of the 5th International Conf. on Natural Language Processing (KBCS 2004), Hyderabad, India (2004)
Google Scholar
Kondrak, G.: A new algorithm for the alignment of phonetic sequences. In: NAACL 2000: Proceedings of the 1st North American Chapter of the Association for Computational Linguistics Conference (2000)
Google Scholar
Huang, C.R.: Sinica BOW: Integrating bilingual WordNet and SUMO ontology. In: International Conference on Natural Language Processing and Knowledge Engineering, pp. 825–826 (2003)
Google Scholar
Gao, Z., et al.: Chinese WordNet (2008), http://www.aturstudio.com/wordnet/windex.php
Isahara, H., Bond, F., Uchimoto, K., Utiyama, M., Kanzaki, K.: Development of the Japanese WordNet. In: Proceedings of the Sixth International Conference on Language Resources and Evaluation, LREC 2008 (2008)
Google Scholar
Jiang, J.J., Conrath, D.W.: Semantic similarity based on corpus statistics and lexical taxonomy. In: Proceedings of International Conference on Research in Computational Linguistics, Taiwan (1997)
Google Scholar
Lin, D.: An information-theoretic definition of similarity. In: Proceedings of 15th International Conference on Machine Learning, Madison WI (1998)
Google Scholar
Resnik, P.: Using information content to evaluate semantic similarity in a taxonomy. In: Proceedings of the 14th International Joint Conference on Artificial Intelligence, Montréal, pp. 448–453 (1995)
Google Scholar
Sogou: 互联网语料库 (SogouT) (2008), http://www.sogou.com/labs/dl/t.html
Reuters: 過去ニュース (2007-2012), http://www.reuters.com/resources/archive/jp/index.html
Zhang, H.J., Shi, S.M., Feng, C., Huang, H.Y.: A method of part-of-speech guessing of Chinese unknown words based on combined features. In: International Conference on Machine Learning and Cybernetics, pp. 328–332 (2009)
Google Scholar

Download references

Author information

Authors and Affiliations

Lab-STICC UMR CNRS 6285, Institut Télécom - Télécom Bretagne, France
Yannis Haralambous

Authors

Yannis Haralambous
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Center for Computing Research, National Polytechnic Institute, Mexico D.F., Mexico
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Haralambous, Y. (2013). New Perspectives in Sinographic Language Processing through the Use of Character Structure. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2013. Lecture Notes in Computer Science, vol 7816. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-37247-6_17

Download citation

DOI: https://doi.org/10.1007/978-3-642-37247-6_17
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-37246-9
Online ISBN: 978-3-642-37247-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

New Perspectives in Sinographic Language Processing through the Use of Character Structure

Abstract

Chapter PDF

Similar content being viewed by others

Chinese Word Segmentation with Character Abstraction

Extracting the Main Content of Web Documents Based on Character Encoding and a Naive Smoothing Method

Asian Character Recognition

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

New Perspectives in Sinographic Language Processing through the Use of Character Structure

Abstract

Chapter PDF

Similar content being viewed by others

Chinese Word Segmentation with Character Abstraction

Extracting the Main Content of Web Documents Based on Character Encoding and a Naive Smoothing Method

Asian Character Recognition

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation