Abstract
The TDT corpora, developed to support the DARPA-sponsored program in Topic Detection and Tracking, combine data collected over a nine month period from 8 English and 3 Chinese sources. The published corpora contain audio, reference text including written news text and transcripts of the broadcast audio, boundary tables segmenting the broadcasts into stories and relevance tables resulting from millions of human judgments. Sections of the corpora have undergone topic-story, first story and story link annotation. Both the TDT-2 and TDT-3 text corpora and the accompanying broadcast audio are now available from the Linguistic Data Consortium. This paper described the raw material collected for the corpora, the annotation of that material to prepare it for research use and the formats in which it is distributed. Special attention is paid to the quality control measures developed for these data sets.
The Linguistic Data Consortium’s work in building the TDT-2 and TDT-3 corpora was supported in part by grant IRI-9528587 from the Information and Intelligent Systems division of the National Science Foundation.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Cieri, Christopher, et al., 2000 Large Multilingual Broadcast News Corpora for Cooperative Research in Topic Detection and Tracking: The TDT2 and TDT3 Corpus Efforts, Proceedings of the Second International Language Resources and Evaluation Conference, Athens, Greece, May 2000.
CLSP - The Johns Hopkins University Center for Language and Speech Processing, 1999, Topic-Based Novelty Detection, http://www.clsp.jhu.edu/ws99/projects/tdt/index.html
Doddington, George, The Topic Detection and Tracking Phase 2 (TDT-2) Evaluation Plan: Overview & Perspective, Proceedings of the Broadcast News Transcription and Understanding Workshop, Lansdowne, Virginia, February 1998.
Doddington, George, 1998, The Topic Detection and Tracking Phase 2 (TDT-2) Evaluation Plan http://www.nist.gov/speech/tdt98/doc/tdt2.eval.plan.98.v3.7.pdf
Garofalo, et. al., 2000, The TREC Spoken Document Retrieval Track: A Success Story, April 2000.
Linguistic Data Consortium, 2000, Topic Detection and Tracking Pages, http://www.ldc.upenn.edu/TDT
NIST — National Institute for Standards and Technology, 1999, 1999 NIST Broadcast News Evaluation, http://www.nist.gov/speech/tests/bnr/bnews_99/bnews_99.htm
NIST — National Institute for Standards and Technology, 2000, ACE — Automatic Content Extraction, http://www.nist.gov/speech/tests/ace/
NIST — National Institute for Standards and Technology, 2000, The 2000 NIST Hub-5 Evaluation, http://www.nist.gov/speech/tests/ctr/h5_2000/index.htm
NIST — National Institute for Standards and Technology, 2000, Topic etection and Tracking, http://www.nist.gov/speech/tests/tdt/tdt2000/index.htm
Strassel, Stephanie, et al., 2000), Quality Control in Large Annotation Projects Involving Multiple Judges: The case of the TDT Corpora Proceedings of the Second International Language Resources and Evaluation Conference, Athens, Greece, May 2000.
Wayne, Charles, 1998, Topic Detection & Tracking: A Case Study in Corpus Creation & Evaluation Methodologies, Proceedings of the First International Conference on Language Resource and Evaluation, Granada, Spain, May 1998.
Wayne, Charles, 1998, Topic Detection and Tracking (TDT): Overview & Perspective, Proceedings of the Broadcast News Transcription and Understanding Workshop, Lansdowne, Virginia, February 1998.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer Science+Business Media New York
About this chapter
Cite this chapter
Cieri, C., Strassel, S., Graff, D., Martey, N., Rennert, K., Liberman, M. (2002). Corpora for Topic Detection and Tracking. In: Allan, J. (eds) Topic Detection and Tracking. The Information Retrieval Series, vol 12. Springer, Boston, MA. https://doi.org/10.1007/978-1-4615-0933-2_3
Download citation
DOI: https://doi.org/10.1007/978-1-4615-0933-2_3
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4613-5311-9
Online ISBN: 978-1-4615-0933-2
eBook Packages: Springer Book Archive