Abstract
This paper applies Chinese subword representations, namely character and syllable n-grams, into the TextTiling-based automatic story segmentation of Chinese broadcast news. We show the robustness of Chinese subwords against speech recognition errors, out-of-vocabulary (OOV) words and versatility in word segmentation in lexical matching on errorful Chinese speech recognition transcripts. We propose a multi-scale TextTiling approach that integrates both the specificity of words and the robustness of subwords in lexical similarity measure for story boundary identification. Experiments on the TDT2 Mandarin corpus show that subword bigrams achieve the best performance among all scales with relative f-measure improvement of 8.84% (character bigram) and 7.11% (syllable bigram) over words. Multi-scale fusion of subword bigrams with words can bring further improvement. It is promising that the integration of syllable bigram with syllable sequence of word achieves an f-measure gain of 2.66% over the syllable bigram alone.
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
Keywords
References
Hearst, M.A.: TexTiling: Segmenting text into multiparagraph subtopic passages. Computational Linguistics 23(1), 33–64 (1997)
Chan, S.K., Xie, L., Meng, H.: Modeling the statistical behavior of lexical chains to capture word cohesiveness for automatic story segmentation. In: Proc. Interspeech, pp. 2851–2854 (2007)
Yamron, J., Carp, I., Gillick, L., Lowe, S., van Mulbregt, P.: A hidden Markov model approach to text segmentation and event tracking. In: Proc. ICASSP, pp. 333–336 (1998)
Rosenberg, A., Hirschberg, J.: Story segmentation of broadcast news in English, Mandarin and Arabic. In: Proc. HLT-NAACL, pp. 125–128 (2006)
Banerjee, S., Rudnicky, I.A.: A TextTiling based approach to topic boundary detection in meetings. Proc. Interspeech (2006) 57–60
Ng, K.: Subword-based approaches for spoken document retrieval. Ph.D. Thesis of MIT (2000)
Chen, B., Wang, H.M., Lee, L.S.: Discriminating capabilites of syllable-based features and approaches of utilizing them for voice retrieval of speech information in Mandarin Chinese. IEEE Transactions on Speech and Audio Processing 10(5), 202–314 (2002)
Lo, W.K., Meng, H., Ching, P.C.: Multi-scale spoken document retrieval for Cantonese broadcast news. International Journal of Speech Technology 7(2-3), 1381–2416 (2004)
Xie, L., Liu, C., Meng, H.: Combined Use of Speaker- and Tone-Normalized Pitch Reset with Pause Duration for Automatic Story Segmentation in Mandarin Broadcast News. In: Proc. HLT-NAACL, pp. 193–196 (2007)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Xie, L., Zeng, J., Feng, W. (2008). Multi-Scale TextTiling for Automatic Story Segmentation in Chinese Broadcast News. In: Li, H., Liu, T., Ma, WY., Sakai, T., Wong, KF., Zhou, G. (eds) Information Retrieval Technology. AIRS 2008. Lecture Notes in Computer Science, vol 4993. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-68636-1_33
Download citation
DOI: https://doi.org/10.1007/978-3-540-68636-1_33
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-68633-0
Online ISBN: 978-3-540-68636-1
eBook Packages: Computer ScienceComputer Science (R0)