Abstract
Sinica Treebank is both the first Chinese treebank (released in 2000 simultaneously with the Penn Chinese Treebank) and the first treebank fully annotated with thematic role information. As such, the construction of the Sinica Treebank deals with both theory and modeling issues in innovative ways. It deals with challenges posed by the lack of conventions to mark word-break and ends-of-sentence in Chinese texts. The solution was based on maximal resources sharing, as the Sinica Treebank is built upon PoS tagged Sinica Corpus, and rely heavily on the grammatical information of the CKIP lexicon encoded in Information-based Case Grammar (ICG). We discuss the design criteria and annotation guidelines of the Sinica Treebank as well as the three design criteria of: Maximal Resource Sharing, Minimal Structural Complexity, and Optimal Semantic Information.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Abeille, A. (ed.): Treebanks Building and Using Parsed Corpora. Language And Speech Series. Springer, Dordrecht (2003)
Bird, S.: NLTK: the natural language toolkit. In: Proceedings of the COLING/ACL on Interactive Presentation Sessions, pp. 69–72. Association for Computational Linguistics (2006)
Bohmova, A., Hajicova, E.: In: Abeille, A. (ed.) How Much of the Underlying Syntactic Structure Can be Tagged Automatically?, pp. 31–40 (2003)
Brants, T., Skut, W., Uszkoreit, H.: In: Abeille, A. (ed.) Syntactic Annotations of a German Newspaper Corpus, pp. 69–76 (2003)
Chen, F.Y., Tsai, P.F., Chen, K.J., Huang, C.R.: Sinica Treebank. [in Chinese] Computational Linguistics and Chinese Language Processing 4.2, pp. 87–103 (2000)
Chen, K.-J.: Design concepts for chinese parsers. In: Proceedings of the 3rd International Conference on Chinese Information Processing, pp. 1–22 (1992)
Chen, K.-J.: A model for robust chinese parser. In: Computational Linguistics and Chinese Language Processing 1.1, pp. 183–204 (1996)
Chen, K.-J., Liu, S.H.: Word identification for mandarin Chinese sentences. In: Proceedings of COLING-92, pp. 101–105 (1992)
Chen, K.-J., Huang, C.-R.: Features constraints in chinese language parsing. In: Proceedings of ICCPOL ’94, pp. 223–228 (1994)
Chen, K.-J., Huang, C.-R.: Information-based case grammar: a unification-based formalism for parsing Chinese. In: Huang, C.-R., Chen, K.-J., Benjamin, K.T’. (eds.) Readings in Chinese Natural Language Processing. Journal of Chinese Linguistics Monograph Series, no. 9, pp. 23–45. JCL, Berkeley (1996)
Chen, K.-J., Hsieh, Y.-M.: Chinese treebanks and grammar extraction. In: Su, K.-Y., Tsujii, J., Lee, J.-H., et al. (ed.) Proceedings of the First International Joint Conference on Natural Language Processing – IJCNLP 2004, Revised Selected Papers, Hainan Island, China, 22–24 Mar 2004. Lecture Notes in Computer Science, pp. 655–661 (2005)
Chen, K.-J., Liu, S.H., Chang, L.P., Chin, Y.H.: A practical tagger for Chinese corpora. In: Proceedings of ROCLING VII, pp. 111–126 (1994)
Chen, K.-J., Huang, C.-R., Chang, L.-P., Hsu, H.-L.: Sinica corpus: design methodology for balanced corpora. In: Proceedings of the 11th Pacific Asia Conference on Language, Information, and Computation (PACLIC II), Seoul Korea, pp. 167–176 (1996)
Chen, K.-J., Huang, C.-R., Chen, F.-Y., Luo, C.-C., Chang, M.-C., Chen, C.-J., Gao, Z.-M.: In: Abeille, A. (ed.) Sinica Treebank: Design Criteria, Representational Issues and Implementation, pp. 231–248 (2003)
CKIP (Chinese Knowledge Information Processing). The Categorical Analysis of Chinese. [in Chinese] CKIP Technical Report 93-05. Nankang: Academia Sinic (1993)
Gazdar, G., Klein, E., Pullum, G.K., Sag, I.A.: Generalized Phrase Structure Grammar. Blackwell, Cambridge, Harvard University Press, Cambridge (1985)
Huang, C.-R.: Coordination Schemas and Chinese NP Coordination in GPSG. Cahiers de Linguistique Asie Orientale XV.1, pp. 107–127 (1986)
Huang, C.-R., Chen, K.-J., Chen, F.-Y., Chang, L.-L.: Segmentation standard for Chinese natural language processing. In: Computational Linguistics and Chinese Language Processing 2.2, pp. 47–62 (1997)
Huang, C.-R., Chen, K.-J., Chen, F.-Y., Chen, K.-J., Gao, Z.-M., Chen, K.-Y.: Sinica treebank: design criteria, annotation guidelines, and on-line interface. In: Proceedings of 2\(^{nd}\) Chinese Language Processing Workshop (Held in conjunction with the 38\(^{th}\) Annual Meeting of the Association for Computational Linguistics, ACL-2000), Hong Kong, pp. 29–37 (2000)
Huang, C.-R., Kilgarriff, A., Wu, Y., Chiu, C.-M., Smith, S., Rychly, P., Bai, M., Chen, K.-J.: Chinese Sketch Engine and the extraction of grammatical collocations. In: Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing, pp. 48–55 (2005)
Huang, C.-R., Heish, S.-K., Chen, K.-J.: Mandarin Chinese words and parts of speech: A corpus-based study. Routledge, London (2017)
Lee, S.Y.M., Li, S., Huang, C.-R.: Annotating events in an emotion corpus. In: Proceedings of LREC, pp. 3511–3516 (2014)
Lin, F.-W.: Some Reflections on the Thematic System of Information-based Case Grammar (ICG). [In Chinese.] CKIP Technical Report No. 92-01. Nankang: Academia Sinica (1992)
Marcus, M.P., Santorini, B., Marcinkiewicz, M.A.: Building a large annotated corpus of English: The PENN Treebank. Computational Linguistics 19.2, pp. 313–330 (1993)
Oepen, S., Toutanova, K., Shieber, S., Manning, C., Flickinger, D., Brants, T.: The LinGO Redwoods treebank motivation and preliminary applications. In: Proceedings of the 19th international conference on Computational linguistics-II, pp. 1–5 (2002)
Pollard, C., Sag, I.A.: Head-Driven Phrase Structure Grammar. Center for the Study of Language and Information. Chicago Press, Stanford (1994)
Pustejovsky, J.: The Generative Lexicon. MIT Press, Cambridge (1985)
Sag, I., Gazdar, G., Wasow, T., Weisler, S.: Coordination and how to distinguish categories. Natural Language and Linguistic Theories 3, pp. 117–171 (1985)
Tseng, S.-S., Chang, M.-Y., Hsieh, C.-C., Chen, K.J.: Approaches on an experimental Chinese electronic dictionary. In: Proceedings of 1988 International Conference on Computer Processing of Chinese and Oriental Languages, pp. 371–374 (1988)
Uszkoreit, H.: Categorial Unification Grammars. In: Proceedings of COLING’86. Bonn: University of Bonn. Also appeared as Report No. CSLI-86-66. Stanford: Center for the Study of Language and Information (1986)
Xia, F.: The Segmentation Guidelines for the Penn Chinese Treebank (3.0). IRCS Report 00-06. University of Pennsylvania, Philadelphia, PA (2000)
Xia, F.: The Part-of-Speech Tagging Guidelines for the Penn Chinese Treebank (3.0). IRCS Report 00-07. University of Pennsylvania, Philadelphia, PA (2000)
Xia, F., Palmer, M., Xue, N., Okurowski, M.E., Kovarik, J., Chiou, F.-D., Huang, S., Kroch, T., Marcus, M.: Developing guidelines and ensuring consistency for chinese text annotation. In: Proceedings of the Second International Conference on Language Resources and Evaluation (LREC-2000), Athens, Greece (2000)
Xia, F., Han, C., Palmer, M., Joshi, A.: Comparing lexicalized treebank grammars extracted from Chinese, Korean, and English. In: Proceedings of 2\(^{nd}\) Chinese Language Processing Workshop (Held in conjunction with the 38\(^{th}\) Annual Meeting of the Association for Computational Linguistics, ACL-2000), pp. 52–59. Hong Kong (2000)
Xue, N., Xia, F.: The Bracketing Guidelines for the Penn Chinese Treebank (3.0). IRCS Report 00-07. University of Pennsylvania, Philadelphia, PA (2000)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer Science+Business Media Dordrecht
About this chapter
Cite this chapter
Huang, CR., Chen, KJ. (2017). Sinica Treebank. In: Ide, N., Pustejovsky, J. (eds) Handbook of Linguistic Annotation. Springer, Dordrecht. https://doi.org/10.1007/978-94-024-0881-2_23
Download citation
DOI: https://doi.org/10.1007/978-94-024-0881-2_23
Published:
Publisher Name: Springer, Dordrecht
Print ISBN: 978-94-024-0879-9
Online ISBN: 978-94-024-0881-2
eBook Packages: Social SciencesSocial Sciences (R0)