Abstract
The Thai writing system has no natural marks to indicate words or sentences. This is one of the causes for many machine leaning researches including the automatic indexing in Information Retrieval to identify keywords for searching. A new method for constructing lexicons from a corpus text is presented. This method is based on the basic Thai morphologies and Bayesian networks concept. The Bayesian networks are based on the well-known minimal description length (MDL) principle. The MDL concepts allow us to construct the Thai lexicons and are used for segmenting the Thai texts. The segmentation effectiveness in terms of recall/precision is 59/51 while the effectiveness of dictionary procedure has 71/54 of recall/precision. However, this new algorithm does not require any lexicon patterns for training.
Preview
Unable to display preview. Download preview PDF.
References
Chen A., He J., Xu L., Gey F., and Meggs J., Chinese Text Retrieval Without Using a Dictionary, Proc. of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Philadelphia, Pennsylvania, USA, July 27–July 31, 97, pp. 42–49.
Jaruskulchai C. Thai Text Retrieval: Simply Term Weight and Basic Thai Morphological Rules, Technical Report, Dept. of Computer Science, George Washington University, USA, Jan, 1998.
Shibayama M. and Hoshino S. Thai Morphological Analyses Based on the Syllable Formation Rules, Journal of Information Processing, Vol. 15, No. 4, pp. 554–563, 1992.
Varakulsiripunth R., Suchichit W., Juwan S., and Thipchaksurat S., An Analysis on Correct Sentence Selection by Word’s General Usage Frequency, Papers on Natural Language Processing: Multi-lingual Machine Translation and Related Topics (1987–1994), pp. 291–300, 1994.
Kawtrakul A., Thumkanon C. and Seriburi S., A Statistical Approach to Thai Word Filtering, Proc. SNLP’95, The 2nd Symposium on Natural Language Processing, pp. 398–406, August 2–4, 1995, Bangkok, Thailand.
Phraya Uphakit Silapasan, Thai Grammar, Reprinted in 1989. ( 2461)
Jay M. Ponte and W. Bruce Croft, Useg: A Retargetable Word Segmentation Procedure for Information Retrieval, Computer Science Department Amherst, MA, USA.
Sornlertlamvanich V., Word Segmentation for Thai in Machine Translation System, National Electronics and Computer Technology Center, National Science Technology Development Agency, Ministry of Science, Technology and Environment (In Thai).
Vilas Wuwongse and Ampai Pornprasertaskul, Thai syntax Parsing, Proceedings of the Symposium on Natural Language Processing in Thailand, pp. 446–467, 11–17 Mar, 1993.
Sinthupunprathum D. and Buntitanon T , Thai word Processing, Proc. of the Symposium on Natural Language Processing in Thailand, Mar 17–21, 1993, Thailand.(In Thai)
Sproat R., Shih C., Gale W., and Change N., A Stochastic Finite-State Word-Segmentation Algorithm for Chinese, cmp-lg/940508, 5 May, 94.
Jaruskulchai C., An Automatic indexing for Thai Text Retrieval, Doctor’s Thesis, George Washington University, USA, July 22, 98.
Rissanen J., Universal Coding, Information, Prediction, and Estimation, IEEE Transactions on Information Theory, vol. IT-30, No. 4, Julay 1984.
Lam W., and Bacchus Fahiem, Learning Bayesian Belief Networks An approach based on the MDL principle, Computation Intelligence, Vol. 10:4, 1994.
Friedman Nir and Goldszmidt Moises, Sequential Update of Bayesian Network Structure, Uncertainty in Artificial Intelligence, Proc. of the 13th Conference, Edited by Dan Geiger and Prakash Pundalik Shenoy, August 1–3, 1997, pp. 165–174.
Bahl L. R., Jelinek F., and Mercer R. L., A Maximum likelihood approach to continuous Speech Recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. PAMI-5, No. 2, page 179–190, 1983.
Heckerman D., A Tutorial on Learning Bayesian Networks, Technical Report MSR-TR-95-06, Microsoft Research, 1995.
Cover T.M., and Thomas J.A., Elements of Information Theory, John Wiley and Sons, Inc., New York, New York, 1991.
Shannon C.E., Prediction and Entropy of printed English, Bell Systems Technical Journal, 30:50–64, 1951.
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1998 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Jaruskulchai, C. (1998). An automatic Thai lexical acquisition from text. In: Lee, HY., Motoda, H. (eds) PRICAI’98: Topics in Artificial Intelligence. PRICAI 1998. Lecture Notes in Computer Science, vol 1531. Springer, Berlin, Heidelberg . https://doi.org/10.1007/BFb0095290
Download citation
DOI: https://doi.org/10.1007/BFb0095290
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-65271-7
Online ISBN: 978-3-540-49461-4
eBook Packages: Springer Book Archive