Abstract
This paper presents a novel approach to Chinese word extraction based on semantic information of characters. A thesaurus of Chinese characters is conducted. A Chinese lexicon with 63,738 two-character words, together with the thesaurus of characters, are explored to learn semantic constraints between characters in Chinese word-formation, forming a semantic-tag-based HMM. The Baum-Welch re-estimation scheme is then chosen to train parameters of the HMM in the way of unsupervised learning. Various statistical measures for estimating the likelihood of a character string being a word are further tested. Large-scale experiments show that the results are promising: the F-score of this word extraction method can reach 68.5% whereas its counterpart, the character-based mutual information method, can only reach 47.5%.
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Calzolari, N., Bindi, R.: Acquision of Lexical Information from a Large Textual Italian Corpus. In: Proc. of COLING 1990, Helsinki, Finland, pp. 54–59 (1990)
Chien, L.F.: PAT-tree-based Adaptive Keyphrase Extraction for Intelligent Chinese Information Retrieval. In: Information Processing and Management, special issue: Information Retrieval with Asian Language (1998)
Daille, B.: Study and Implementation of Combined Techniques Automatic Extraction of Terminology. In: Proc. of the Balancing Act Workshop at 32nd Annual Meeting of the ACL, pp. 29–36 (1994)
Dunning, T.: Accurate Method for the Statistics of Surprise and Coincidence. Computational Linguistics 19(1), 61–75 (1993)
Gelbukh, A., Sidorov, G.: Approach to Construction of Automatic Morphological Analysis Systems for Inflective Languages with Little Effort. In: Gelbukh, A. (ed.) CICLing 2003. LNCS, vol. 2588, pp. 215–220. Springer, Heidelberg (2003)
Hajic, J.: HMM Parameters Estimation: The Baum-Welch Algorithm (2000), http://www.cs.jhuedu/ hajic
Johansson, C.: Good Bigrams. In: Proc. of COLING 1996, Copenhagen, Denmark (1996)
Mei, J.J.: Tong Yi Ci Ci Lin. Shanghai Cishu Press (1983)
Merkel, M., Andersson, M.: Knowledge-lite Extraction of Multi-word Units with Language Filters and Entropy Thresholds. In: Proc. of RIAO 2000, Paris, France, pp. 737–746 (2000)
Nie, J.Y., Hannan, M.L., Jin, W.: Unknown Word Detection and Segmentation of Chinese Using Statistical and Heuristic Knowledge. Communications of COLIPS 5, 47–57 (1999)
Sornlertlamvanich, V., Potipiti, T., Charoenporn, T.: Automatic Corpus-based Thai Word Extraction with the C4.5 Learning Algorithm. In: Proc. of COLING 2000, Saarbrucken, Germany, pp. 802–807 (2000)
Sun, M.S., Shen, D.Y., Huang, C.N.: CSeg&Tag1.0: A Practical Word Segmenter and POS Tagger for Chinese Texts. In: Proc. of the 5th Int’l Conference on Applied Natural Language Processing, Washington DC, USA, pp. 119–126 (1997)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Sun, M., Luo, S., T’sou, B.K. (2005). Word Extraction Based on Semantic Constraints in Chinese Word-Formation. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2005. Lecture Notes in Computer Science, vol 3406. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30586-6_20
Download citation
DOI: https://doi.org/10.1007/978-3-540-30586-6_20
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-24523-0
Online ISBN: 978-3-540-30586-6
eBook Packages: Computer ScienceComputer Science (R0)