Abstract
With the growing popularity of multimodal data on the Web, cross-modal retrieval on large-scale multimedia databases has become an important research topic. Cross-modal retrieval methods based on hashing assume that there is a latent space shared by multimodal features. To model the relationship among heterogeneous data, most existing methods embed the data into a joint abstraction space by linear projections. However, these approaches are sensitive to noise in the data and are unable to make use of unlabeled data and multi-modal data with missing values in real-world applications. To address these challenges, we proposed a novel multimodal deep-learning-based hash (MDLH) algorithm. In particular, MDLH uses a deep neural network to encode heterogeneous features into a compact common representation and learns the hash functions based on the common representation. The parameters of the whole model are fine-tuned in a supervised training stage. Experiments on two standard datasets show that the method achieves more effective results than other methods in cross-modal retrieval.
创新点
随着网络上多模态数据的普及, 海量多媒体数据库上的跨模态检索成为研究的热点。跨模态检索方法假设多个模态的数据特征之间存在一个共享的潜在特征空间。因此, 为了建立多模态数据之间的关联, 大部分已有方法通过线性映射将多模态数据分别映射到同一个共享特征空间。但是, 该类方法对于数据中的噪声比较敏感, 并且也无法使用现实场景中的无标记的数据或缺失模态的数据。针对该问题本文提出了一种新的基于多模态深度学习的哈希算法。该方法使用深度神经网络结构将异构特征映射为一个共同的压缩表示, 并在此表示的基础上学习哈希函数。整个模型的参数通过有监督的方式进行训练。在两个标准数据集上的实验结果显示本文的方法能够有效的完成跨模态检索的任务。
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Chen C, Zhu Q S, Lin L, et al. Web media semantic concept retrieval via tag removal and model fusion. ACM Trans Intel Syst Technol, 2013, 4: 478–488
Leung C H C, Chan A W S, Milani A, et al. Intelligent social media indexing and sharing using an adaptive indexing search engine. ACM Trans Intel Syst Technol, 2012, 3: 338–343
Zhang R M, Lin L, Zhang R, et al. Bit-scalable deep hashing with regularized similarity learning for image retrieval and person re-identification. IEEE Trans Imag Process, 2015, 24: 4766–4779
Nie X S, Liu J, Sun J D, et al. Robust video hashing based on representative-dispersive frames. Sci China Inf Sci, 2013, 56: 068104
Xiang S J, Yang J Q, Huang J W. Perceptual video hashing robust against geometric distortions. Sci China Inf Sci, 2012, 55: 1520–1527
Datar M, Immorlica N, Indyk P, et al. Locality-sensitive hashing scheme based on p-stable distributions. In: Proceedings of ACM Symposium on Computational Geometry, New York, 2004. 253–262
Weiss Y, Torralba A, Fergus R. Spectral hashing. In: Proceedings of 22nd Annual Conference on Neural Information Processing Systems, Vancouver, 2008. 1753–1760
Zhen Y, Yang D. A probabilistic model for multimodal hash function learning. In: Proceedings of the 18th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Beijing, 2012. 940–948
Zhu X F, Huang Z, Shen H T, et al. Linear cross-modal hashing for efficient multimedia search. In: Proceedings of the 21st ACM International Conference on Multimedia, Barcelona, 2013. 143–152
Yu Z, Wu F, Yang Y, et al. Discriminative coupled dictionary hashing for fast cross-media retrieval. In: Proceedings of the 37th Annual ACM SIGIR Conference, Gold Coast, 2014. 395–404
Bronstein M, Bronstein A, Michel F, et al. Data fusion through cross-modality metric learning using similarity-sensitive hashing. In: Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, 2010. 3594–3601
Kumar S, Udupa R. Learning hash functions for cross-view similarity search. In: Proceedings of the 25th International Joint Conference on Artificial Intelligence, New York, 2011. 1360–1365
Hu Y, Jin Z M, Ren H Y, et al. Iterative multi-view hashing for cross media indexing. In: Proceedings of the 22nd ACM International Conference on Multimedia, Orlando, 2014. 527–536
Song J K, Yang Y, Yang Y, et al. Inter-media hashing for large-scale retrieval from heterogeneous data sources. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, New York, 2013. 785–796
Wu B T, Yang Q, Zheng W S, et al. Quantized correlation hashing for fast cross-modal search. In: Proceedings of International Joint Conference on Artificial Intelligence, Buenos Aires, 2015. 3946–3952
Kang Y, Kim S, Choi S. Deep learning to hash with multiple representations. In: Proceedings of IEEE International Conference on Data Mining, Brusselsm, 2012. 930–935
Wang D X, Cui P, Ou M D, et al. Deep multimodal hashing with orthogonal regularization. In: Proceedings of International Joint Conference on Artificial Intelligence, Buenos Aires, 2015. 2291–2297
Wang Q F, Si L, Shen B. Learning to hash on partial multimodal data. In: Proceedings of International Joint Conference on Artificial Intelligence, Buenos Aires, 2015. 3904–3910
Dahl G E, Yu D, Deng L, et al. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans Audio Speech, 2012, 20: 30–42
Krizhevsky A, Sutskever I, Hinton G. Imagenet classification with deep convolutional neural networks. In: Proceedings of Annual Conference on Neural Information Processing Systems, Lake Tahoe, 2012. 1106–1114
Ngiam J, Khosla A, Kim M, et al. Multimodal deep learning. In: Proceedings of International Conference on Machine Learning, Washington, 2011. 689–696
Srivastava N, Salakhutdinov R. Multimodal learning with deep Boltzmann machines. In: Proceedings of the 26th Annual Conference on Neural Information Processing Systems, Lake Tahoe, 2012. 2231–2239
Sohn K, Shang W, Lee H. Improved multimodal deep learning with variation of information. In: Proceedings of the 28th Annual Conference on Neural Information Processing Systems, Montreal, 2014. 2141–2149
Wu P C, Hoi S C, Xia H, et al. Online multimodal deep similarity learning with application to image retrieval. In: Proceedings of the 21st ACM International Conference on Multimedia, Barcelona, 2013. 153–162
Wang W, Ooi B C, Yang X Y, et al. Effective multi-modal retrieval based on stacked autoencoders. In: Proceedings of 40th International Conference on Very Large Data Bases, Hangzhou, 2014. 649–660
Feng F X, Wang X J, Li R F. Cross-modal retrieval with correspondence autoencoder. In: Proceedings of the 21st ACM International Conference on Multimedia, Orlando, 2014. 7–16
Vincent P, Larochelle H, Lajoie I, et al. Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J Mach Learn Res, 2010, 11: 3371–3408
Salakhutdinov R, Hinton G. Deep Boltzmann machines. In: Proceedings of 12th International Conference on Artificial Intelligence and Statistics, Florida, 2009. 448–455
Hinton G, Salakhutdinov R. Reducing the dimensionality of data with neural networks. Science, 2006, 313: 504–507
Bengio Y, Lamblin P, Popovici D, et al. Greedy layer-wise training of deep networks. In: Proceedings of Annual Conference on Neural Information Processing Systems, Vancouver, 2006. 153–160
Rumelhart D, Hinton G, Williams R. Neurocomputing: Foundations of Research. Cambridge: MIT Press, 1988
Rasiwasia N, Pereira J, Coviello E, et al. A new approach to cross-modal multimedia retrieval. In: Proceedings of the 18th ACM International Conference on Multimedia, New York, 2010. 251–260
Blei D, Ng A, Jordan M. Latent dirichlet allocation. J Mach Learn Res, 2003, 3: 993–1022
Lowe D. Distinctive image features from scale-invariant key points. Int J Comput Vision, 2004, 60: 91–110
Chua T S, Tang J H, Hong R C, et al. NUS-WIDE: a real-world web image database from National University of Singapore. In: Proceedings of ACM International Conference on Image and Video Retrieval, Santorini, 2009. 1–9
Zhou J, Ding G G, Guo Y C. Latent semantic sparse hashing for cross-modal similarity search. In: Proceedings of the 37th Annual International ACMSIGIR Conference, Gold Coast, 2014. 415–424
Acknowledgements
This work was supported by National Natural Science Foundation of China (Grant Nos. 61402091, 61370074), and Fundamental Research Funds for the Central Universities of China (Grant No. N140404012).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Qu, W., Wang, D., Feng, S. et al. A novel cross-modal hashing algorithm based on multimodal deep learning. Sci. China Inf. Sci. 60, 092104 (2017). https://doi.org/10.1007/s11432-015-0902-2
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11432-015-0902-2