Abstract
Multimedia Event Detection is a multimedia retrieval task with the goal of finding videos of a particular event in an internet video archive, given example videos and descriptions. We focus here on mining features of example videos to learn the most characteristic features, which requires a combination of multiple complementary types of features. Generally, early fusion and late fusion are two popular combination strategies. The former one fuses features before performing classification and the latter one combines output of classifiers from different features. In this paper, we introduce a fusion scheme named double fusion, which combines early fusion and late fusion together to incorporate their advantages. Results are reported on TRECVID MED 2010 and 2011 data sets. For MED 2010, we get a mean minimal normalized detection cost (MNDC) of 0.49, which exceeds the state of the art performance by more than 12 percent.
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
References
Smeaton, A.F., Over, P., Kraaij, W.: Evaluation campaigns and TRECVid. In: 8th ACM International Workshop on Multimedia Information Retrieval, MIR 2006 (2006)
Liu, J., Luo, J., Shah, M.: Recognizing realistic actions from videios ’in the wild’. In: IEEE Conference on Computer Vision and Pattern Recognition CVPR 2009 (2009)
Hauptmann, A., Yan, R., Lin, W., Christel, M., Wactlar, H.: Can high- level concepts fill the semantic gap in video Retrieval? A case study with broadcast news. IEEE Transaction on Multimedia 9(5), 958–966 (2007)
Oliva, A., Torralba, A.: Modeling the shape of the scene: a holistic representation of the spatial envelope. Int. J. Computer Vision 42(3), 145–175 (2001)
Yang, Y., Zhuang, Y., Wu, F., Pan, Y.: Harmonizing hierarchical manifolds for multimedia document semantics understanding and cross-media retrieval. IEEE Transactions on Multimedia (TMM 2008) 10(3), 437–446 (2008)
Liu, J., Yang, Y., Shah, M.: Learning semantic visual vocabularies using diffusion distance. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2009 (2009)
Jiang, Y.G., Zeng, X.H., Chang, S.F., et al.: Columbia-UCF TRECVID 2010 multimedia event detection: combining multiple modalities, contextual concepts, and temporal matching. In: Proceeding TRECVID Workshop (2010)
Iyengar, G., Nock, H., Neti, C.: Discriminative model fusion for semantic concept detection and annotation in video. In: Proceedings of 11th Annual ACM International Conference Multimedia, MM 2003 (2003)
Snoek, C.G.M., Worringm, M., Smeulders, A.W.M.: Early versus late fusion in semantic video analysis. In: Proceedings of 13th Annual ACM International Conference Multimedia, MM 2005 (2005)
Li, H., Bao, L., Hauptmann, A., et al.: Informedia@ TRECVID 2010. In: Proceedings of TRECVID Workshop (2010)
Gehler, P., Nowozin, S.: On feature combination for multiclass object classification. In: Proceedings of International Conference Computer Vision, ICCV 2009 (2009)
Cortes, C., Mohri, M., Rostamizadeh, A.: L 2 regularization for learning kernels. In: Proceedings of Uncertainty Artitical Intelligence, UAI 2009 (2009)
Erp, M.V., Vuurpijl, L.G., Schomaker, L.: An overview and comparison of voting methods for pattern recognition. In: Proceedings of the 8th International Workshop on Frontiers in Handwriting Recognition, IWFHR-8 (2002)
Brefeld, U., Gaertner, T., Scheffer, T., Wrobel, S.: Efficient co-regularized least squares regression. In: Proceedings of the 23rd International Conference of Machine Learning, ICML 2006 (2006)
Ayache, S., Quénot, G., Gensel, J.: Classifier Fusion for SVM-Based Multimedia Semantic Indexing. In: Amati, G., Carpineto, C., Romano, G. (eds.) ECIR 2007. LNCS, vol. 4425, pp. 494–504. Springer, Heidelberg (2007)
van de Sande, K.E.A., Gevers, T., Snoek, C.G.M.: Evaluation of color descriptors for object and scene recognition. In: Proceedings IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2008 (2008)
Chen, M.Y., Hauptmann, A.: MoSIFT: Recognition human actions in surveillance videos. Technological report, CMU-CS-09-161, Carnegie Mellon University (2009)
Laptev, I., Lindeberg, T.: Space-time interest points. In: Proceedings of International Conference Computer Vision, ICCV 2003 (2003)
Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In: Computer Vision and Pattern Recognition 2006, CVPR 2006 (2006)
Lowe, D.G.: Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision (IJCV 2004) 60(2), 91–100 (2004)
Chang, C.-C., Lin, C.-J.: LIBSVM: a library for support vector machines (2001)
Vedaldi, A., Fulkerson, B.: VLFeat: An Open and Portable Library of Computer Vision Algorithms (2008)
Bernhard, S., Burges, C.J.C., Smola, A.J.: Advances in kernel methods: Support Vector Learning. MIT Press, Cambridge (1999)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Lan, Zz., Bao, L., Yu, SI., Liu, W., Hauptmann, A.G. (2012). Double Fusion for Multimedia Event Detection. In: Schoeffmann, K., Merialdo, B., Hauptmann, A.G., Ngo, CW., Andreopoulos, Y., Breiteneder, C. (eds) Advances in Multimedia Modeling. MMM 2012. Lecture Notes in Computer Science, vol 7131. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-27355-1_18
Download citation
DOI: https://doi.org/10.1007/978-3-642-27355-1_18
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-27354-4
Online ISBN: 978-3-642-27355-1
eBook Packages: Computer ScienceComputer Science (R0)