Abstract
Text retrieval from broadcast news video is unsatisfactory, because a transcript word frequently does not directly ‘describe’ the shot when it was spoken. Extending the retrieved region to a window around the matching keyword provides better recall, but low precision. We improve on text retrieval using the following approach: First we segment the visual stream into coherent story-like units, using a set of visual news story delimiters. After filtering out clearly irrelevant classes of shots, we are still left with an ambiguity of how words in the transcript relate to the visual content in the remaining shots of the story. Using a limited set of visual features at different semantic levels ranging from color histograms, to faces, cars, and outdoors, an association matrix captures the correlation of these visual features to specific transcript words. This matrix is then refined using an EM approach. Preliminary results show that this approach has the potential to significantly improve retrieval performance from text queries.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Brown, P.F., Della Pietra, S.A., Della Pietra, V.J., Mercer, R.L.: The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics 19, 2 (1993)
Chua, T.-S., Zhao, Y., Chaisorn, L., Koh, C.-K., Yang, H., Xu, H.: TREC 2003 Video Retrieval and Story Segmentation task at NUS PRIS. In: TREC (VIDEO) Conference (2003)
Duygulu, P., Barnard, K., de Freitas, N., Forsyth, D.: Object recognition as machine translation: learning a lexicon for a fixed image vocabulary. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2353, pp. 97–112. Springer, Heidelberg (2002)
Duygulu, P., Wactlar, H.: Associating video frames with text. In: Multimedia Information Retrieval Workshop, in conjuction with ACM-SIGIR (2003)
Duygulu, P., Chen, M.-Y., Hauptmann, A.: Comparison and Combination of Two Novel Commercial Detection Methods. In: ICME 2004 (2004)
Hamerly, G., Elkan, C.: Learning the k in k-means. In: NIPS 2003 (2003)
Hauptmann, A., et al.: Informedia at TRECVID 2003: Analyzing and Searching Broadcast News Video. In: TREC (VIDEO) Conference (2003)
Kumar, S., Hebert, M.: Man-Made Structure Detection in Natural Images using a Causal Multiscale Random Field. In: CVPR (2003)
Pan, J.-Y., Yang, H.-J., Duygulu, P., Faloutsos, C.: Automatic Image Captioning. In: ICME 2004 (2004)
Schneiderman, H., Kanade, T.: Object detection using the statistics of parts. International Journal of Computer Vision (2002)
TRECVID (2003), http://www-nlpir.nist.gov/projects/tv2003/tv2003.html
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Duygulu, P., Hauptmann, A. (2004). What’s News, What’s Not? Associating News Videos with Words. In: Enser, P., Kompatsiaris, Y., O’Connor, N.E., Smeaton, A.F., Smeulders, A.W.M. (eds) Image and Video Retrieval. CIVR 2004. Lecture Notes in Computer Science, vol 3115. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-27814-6_19
Download citation
DOI: https://doi.org/10.1007/978-3-540-27814-6_19
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-22539-3
Online ISBN: 978-3-540-27814-6
eBook Packages: Springer Book Archive