Multimodal Video Concept Detection via Bag of Auditory Words and Multiple Kernel Learning

Mühling, Markus; Ewerth, Ralph; Zhou, Jun; Freisleben, Bernd

doi:10.1007/978-3-642-27355-1_7

Markus Mühling²²,
Ralph Ewerth²²,
Jun Zhou²² &
…
Bernd Freisleben²²

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7131))

Included in the following conference series:

International Conference on Multimedia Modeling

2139 Accesses
12 Citations

Abstract

State-of-the-art systems for video concept detection mainly rely on visual features. Some previous approaches have also included audio features, either using low-level features such as mel-frequency cepstral coefficients (MFCC) or exploiting the detection of specific audio concepts. In this paper, we investigate a bag of auditory words (BoAW) approach that models MFCC features in an auditory vocabulary. The resulting BoAW features are combined with state-of-the-art visual features via multiple kernel learning (MKL). Experiments on a large set of 101 video concepts from the MediaMill Challenge show the effectiveness of using BoAW features: The system using BoAW features and a support vector machine with a χ ²-kernel is superior to a state-of-the-art audio approach relying on probabilistic latent semantic indexing. Furthermore, it is shown that an early fusion approach degrades detection performance, whereas the combination of auditory and visual bag of words features via MKL yields a relative performance improvement of 9%.

Access provided by Autonomous University of Puebla. Download to read the full chapter text

Chapter PDF

Multimodal Fusion: Combining Visual and Textual Cues for Concept Detection in Video

Semantic Concept Annotation of Consumer Videos at Frame-Level Using Audio

Temporal Acoustic Words for Online Acoustic Event Detection

Keywords

References

Bredin, H., Koenig, L., Farinas, J.: IRIT @ TRECVid 2010: Hidden Markov Models for Context-aware Late Fusion of Multiple Audio Classifiers. In: TREC Video Retrieval Evaluation Workshop, TRECVid 2010 (2010)
Google Scholar
Diou, C., Stephanopoulos, G., Delopoulos, A.: The Multimedia Understanding Group at TRECVID-2010. In: TREC Video Retrieval Evaluation Workshop, TRECVid 2010 (2010)
Google Scholar
Elleuch, N., Zarka, M., Feki, I., Ammar, A.B.E.N., Alimi, A.M.: REGIMVID at TRECVID 2010: Semantic Indexing. In: TREC Video Retrieval Evaluation Workshop, TRECVid 2010 (2010)
Google Scholar
Feki, I., Ammar, A.B., Alimi, A.M.: Audio Stream Analysis for Environmental Sound Classification. In: International Conference on Multimedia Computing and Systems (2011)
Google Scholar
Gorisse, D., Precioso, F., Gosselin, P., Granjon, L., Pellerin, D., Rombaut, M., Bredin, H., Koenig, L., Lachambre, H., Khoury, E.E., Vieux, R., Mansencal, B., Zhou, Y., Benois-Pineau, J., Jégou, H., Ayache, S., Safadi, B., Quénot, G., Benoît, A., Lambert, P.: IRIM at TRECVID 2010: Semantic Indexing and Instance Search. In: TREC Video Retrieval Evaluation Workshop, TRECVid 2010 (2010)
Google Scholar
Hauptmann, A., Yan, R., Lin, W.H.: How Many High-Level Concepts Will Fill the Semantic Gap in News Video Retrieval? In: International Conference on Image and Video Retrieval, pp. 627–634. ACM, New York (2007)
Google Scholar
Inoue, N., Saito, T., Shinoda, K., Furui, S.: High-Level Feature Extraction Using SIFT GMMs and Audio Models. In: 20th International Conference on Pattern Recognition, pp. 3220–3223. IEEE (2010)
Google Scholar
Inoue, N., Wada, T., Kamishima, Y., Shinoda, K., Kim, I., Byun, B., Lee, C.H.: TT+GT at TRECVID 2010 Workshop. In: TREC Video Retrieval Evaluation Workshop, TRECVid 2010 (2010)
Google Scholar
Jiang, W., Cotton, C., Chang, S.F., Ellis, D., Loui, A.: Short-Term Audio-Visual Atoms for Generic Video Concept Classification. In: 17th ACM International Conference on Multimedia, pp. 5–14. ACM Press, New York (2009)
Google Scholar
Jiang, Y.G., Ngo, C.W., Yang, J.: Towards Optimal Bag-of-Features for Object Categorization and Semantic Video Retrieval. In: International Conference on Image and Video Retrieval, pp. 494–501. ACM, New York (2007)
Google Scholar
Jiang, Y.G., Yang, J., Ngo, C.W., Hauptmann, A.G.: Representations of Keypoint-Based Semantic Concept Detection: A Comprehensive Study. IEEE Transactions on Multimedia 12, 42–53 (2010)
Article Google Scholar
Jiang, Y.G., Zeng, X., Ye, G., Bhattacharya, S., Ellis, D., Shah, M., Chang, S.F.: Columbia-UCF TRECVID 2010 Multimedia Event Detection: Combining Multiple Modalities, Contextual Concepts, and Temporal Matching. In: TREC Video Retrieval Evaluation Workshop, TRECVid 2010 (2010)
Google Scholar
Joachims, T.: Text Categorization With Support Vector Machines: Learning With Many Relevant Features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)
Chapter Google Scholar
Li, H., Bao, L., Gao, Z., Overwijk, A., Liu, W., Zhang, L.F., Shoou-I, Y., Chen, M.Y., Florian, M., Hauptmann, A.: Informedia @ TRECVID 2010. In: TREC Video Retrieval Evaluation Workshop, TRECVid 2010 (2010)
Google Scholar
Lowe, D.G.: Distinctive Image Features from Scale-Invariant Keypoints. International Journal of Computer Vision 60(2), 91–110 (2004)
Article Google Scholar
Lu, L., Hanjalic, A.: Audio Keywords Discovery for Text-Like Audio Content Analysis and Retrieval. IEEE Transactions on Multimedia 10(1), 74–85 (2008)
Article Google Scholar
Mallat, S., Zhang, Z.: Matching Pursuits With Time-Frequency Dictionaries. IEEE Transactions on Signal Processing 41(12), 3397–3415 (1993)
Article MATH Google Scholar
Peng, Y., Lu, Z., Xiao, J.: Semantic Concept Annotation Based on Audio PLSA Model. In: 17th ACM International Conference on Multimedia (MM 2009), pp. 841–844. ACM Press, New York (2009)
Google Scholar
Riley, M., Heinen, E., Ghosh, J.: A Text Retrieval Approach to Content-based Audio Retrieval. In: 9th International Conference of Music Information Retrieval, pp. 295–300 (2008)
Google Scholar
Smeaton, A.F., Over, P., Kraaij, W.: Evaluation Campaigns and TRECVid. In: 8th ACM International Workshop on Multimedia Information Retrieval, pp. 321–330. ACM Press, New York (2006)
Google Scholar
Snoek, C.G.M., van de Sande, K.E.A., Rooij, O.D., Huurnink, B., Uijlings, J.R.R., Liempt, M.V., Bugalho, M., Trancoso, I., Yan, F., Tahir, M.A., Mikolajczyk, K., Kittler, J., de Rijke, M., Geusebroek, J.M., Gevers, T., Worring, M., Smeulders, A.W.M., Koelma, D.C.: The MediaMill TRECVID 2009 Semantic Video Search Engine. In: TREC Video Retrieval Evaluation Workshop, TRECVid 2009 (2009)
Google Scholar
Snoek, C.G.M., Worring, M., van Gemert, J.C., Geusebroek, J.M., Smeulders, A.W.M.: The Challenge Problem for Automated Detection of 101 Semantic Concepts in Multimedia. In: ACM International Conference on Multimedia, pp. 421–430. ACM, New York (2006)
Google Scholar
Sonnenburg, S., Rätsch, G., Henschel, S., Widmer, C., Behr, J., Zien, A., Bona, F., Binder, A., Gehl, C., Franc, V.: The SHOGUN Machine Learning Toolbox. Journal of Machine Learning Research 99, 1799–1802 (2010)
MATH Google Scholar
Vedaldi, A., Fulkerson, B.: VLFeat: An Open and Portable Library of Computer Vision Algorithms (2008), http://www.vlfeat.org/

Download references

Author information

Authors and Affiliations

Department of Mathematics & Computer Science, University of Marburg, Hans-Meerwein-Str. 3, D-35032, Marburg, Germany
Markus Mühling, Ralph Ewerth, Jun Zhou & Bernd Freisleben

Authors

Markus Mühling
View author publications
You can also search for this author in PubMed Google Scholar
Ralph Ewerth
View author publications
You can also search for this author in PubMed Google Scholar
Jun Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Bernd Freisleben
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute of Information Technology, Alpen-Adria-Universität Klagenfurt, Universitätsstr. 65-67, 9020, Klagenfurt, Austria
Klaus Schoeffmann
EURECOM, 2229 Rout des Crêtes, BP 193, 06904, Sophia Antipolis Cedex, France
Bernard Merialdo
School of Computer Science, Carnegie Mellon University, 5000 Forbes Ave, 15213-3890, Pittsburgh, PA, USA
Alexander G. Hauptmann
Department of Computer Science, City University of Hong Kong, Tat Chee Ave, Kowloon, Hong Kong
Chong-Wah Ngo
Department of Electronic and Electrical Engineering, University College London, Roberts Building, Torrington Place, WC1E 7JE, London, UK
Yiannis Andreopoulos
Institute of Software Technology and Interactive Systems, Vienna University of Technology, Favoritenstrasse 9-11 188/2, 1040, Vienna, Austria
Christian Breiteneder

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mühling, M., Ewerth, R., Zhou, J., Freisleben, B. (2012). Multimodal Video Concept Detection via Bag of Auditory Words and Multiple Kernel Learning. In: Schoeffmann, K., Merialdo, B., Hauptmann, A.G., Ngo, CW., Andreopoulos, Y., Breiteneder, C. (eds) Advances in Multimedia Modeling. MMM 2012. Lecture Notes in Computer Science, vol 7131. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-27355-1_7

Download citation

DOI: https://doi.org/10.1007/978-3-642-27355-1_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-27354-4
Online ISBN: 978-3-642-27355-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Multimodal Video Concept Detection via Bag of Auditory Words and Multiple Kernel Learning

Abstract

Chapter PDF

Similar content being viewed by others

Multimodal Fusion: Combining Visual and Textual Cues for Concept Detection in Video

Semantic Concept Annotation of Consumer Videos at Frame-Level Using Audio

Temporal Acoustic Words for Online Acoustic Event Detection

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Multimodal Video Concept Detection via Bag of Auditory Words and Multiple Kernel Learning

Abstract

Chapter PDF

Similar content being viewed by others

Multimodal Fusion: Combining Visual and Textual Cues for Concept Detection in Video

Semantic Concept Annotation of Consumer Videos at Frame-Level Using Audio

Temporal Acoustic Words for Online Acoustic Event Detection

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation