Abstract
In this paper, we propose an approach for generating rich fine-grained textual descriptions of images. In particular, we use an LSTM-in-LSTM (long short-term memory) architecture, which consists of an inner LSTM and an outer LSTM. The inner LSTM effectively encodes the long-range implicit contextual interaction between visual cues (i.e., the spatiallyconcurrent visual objects), while the outer LSTM generally captures the explicit multi-modal relationship between sentences and images (i.e., the correspondence of sentences and images). This architecture is capable of producing a long description by predicting one word at every time step conditioned on the previously generated word, a hidden vector (via the outer LSTM), and a context vector of fine-grained visual cues (via the inner LSTM). Our model outperforms state-of-theart methods on several benchmark datasets (Flickr8k, Flickr30k, MSCOCO) when used to generate long rich fine-grained descriptions of given images in terms of four different metrics (BLEU, CIDEr, ROUGE-L, and METEOR).
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Farhadi, A.; Hejrati, M.; Sadeghi, M. A.; Young, P.; Rashtchian, C.; Hockenmaier, J.; Forsyth, D. Every picture tells a story: Generating sentences from images. In: Computer Vision—ECCV 2010. Daniilidis, K.; Maragos, P.; Paragios, N. Eds. Springer Berlin Heidelberg, 15–29, 2010.
Kulkarni, G.; Premraj, V.; Ordonez, V.; Dhar, S.; Li, S.; Choi, Y.; Berg, A. C.; Berg, T. L. BabyTalk: Understanding and generating simple image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 35, No. 12, 2891–2903, 2013.
Li, S.; Kulkarni, G.; Berg, T. L.; Berg, A. C.; Choi, Y. Composing simple image descriptions using web-scale n-grams. In: Proceedings of the 15th Conference on Computational Natural Language Learning, 220–228, 2011.
Gong, Y.; Wang, L.; Hodosh, M.; Hockenmaier, J.; Lazebnik, S. Improving image-sentence embeddings using large weakly annotated photo collections. In: Computer Vision—ECCV 2014. Fleet, D.; Pajdla, T.; Schiele, B.; Tuytelaars, T. Eds. Springer International Publishing, 529–545, 2014.
Hodosh, M.; Young, P.; Hockenmaier, J. Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research Vol. 47, 853–899, 2013.
Ordonez, V.; Kulkarni, G.; Berg, T. L. Im2text: Describing images using 1 million captioned photographs. In: Proceedings of Advances in Neural Information Processing Systems, 1143–1151, 2011.
Socher, R.; Karpathy, A.; Le, Q. V.; Manning, C. D.; Ng, A. Y. Grounded compositional semantics for finding and describing images with sentences. Transactions of the Association for Computational Linguistics Vol. 2, 207–218, 2014.
Karpathy, A.; Fei-Fei, L. Deep visual-semantic alignments for generating image descriptions. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 3128–3137, 2015.
Mao, J.; Xu, W.; Yang, Y.; Wang, J.; Huang, Z.; Yuille, A. Deep captioning with multimodal recurrent neural networks (m-RNN). arXiv preprint arXiv:1412.6632, 2014.
Vinyals, O.; Toshev, A.; Bengio, S.; Erhan, D. Show and tell: A neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3156–3164, 2015.
Jin, J.; Fu, K.; Cui, R.; Sha, F.; Zhang, C. Aligning where to see and what to tell: Image caption with region-based attention and scene factorization. arXiv preprint arXiv:1506.06272, 2015.
Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhutdinov, R.; Zemel, R. S.; Bengio, Y. Show, attend and tell: Neural image caption generation with visual attention. In: Proceedings of the 32nd International Conference on Machine Learning, 2048–2057, 2015.
Bengio, Y.; Schwenk, H.; Senecal, J.-S.; Morin, F.; Gauvain, J.-L. Neural probabilistic language models. In: Innovations in Machine Learning. Holmes, D. E.; Jain, L. C. Eds. Springer Berlin Heidelberg, 137–186, 2006.
Palangi, H.; Deng, L.; Shen, Y.; Gao, J.; He, X.; Chen, J.; Song, X.; Ward, R. Deep sentence embedding using the long short term memory network: Analysis and application to information retrieval. IEEE/ACM Transactions on Audio, Speech, and Language Processing Vol. 24, No. 4, 694–707, 2016.
Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
Krizhevsky, A.; Sutskever, I.; Hinton, G. E. Imagenet classification with deep convolutional neural networks. In: Proceedings of Advances in Neural Information Processing Systems, 1097–1105, 2012.
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 580–587, 2014.
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. In: Computer Vision—ECCV 2014. Fleet, D.; Pajdla, T.; Schiele, B.; Tuytelaars, T. Eds. Springer International Publishing, 346–361, 2014.
Girshick, R. Fast r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision, 1440–1448, 2015.
Karpathy, A.; Joulin, A.; Li, F. F. F. Deep fragment embeddings for bidirectional image sentence mapping. In: Proceedings of Advances in Neural Information Processing Systems, 1889–1897, 2014.
Elliott, D.; Keller, F. Image description using visual dependency representations. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, 1292–1302, 2013.
Sutton, R. S.; Barto, A. G. Reinforcement Learning: An Introduction. The MIT Press, 1998.
Sutskever, I.; Vinyals, O.; Le, Q. V. Sequence to sequence learning with neural networks. In: Proceedings of Advances in Neural Information Processing Systems, 3104–3112, 2014.
Graves, A. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.
Young, P.; Lai, A.; Hodosh, M.; Hockenmaier, J. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics Vol. 2, 67–78, 2014.
Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollar, P.; Zitnick, C. L. Microsoft COCO: Common objects in context. In: Computer Vision—ECCV 2014. Fleet, D.; Pajdla, T.; Schiele, B.; Tuytelaars, T. Eds. Springer International Publishing, 740–755, 2014.
Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.-J. BLEU: A method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, 311–318, 2002.
Lin, C.-Y. ROUGE: A package for automatic evaluation of summaries. In: Text Summarization Branches Out: Proceedings of the ACL-04 Workshop, Vol. 8, 2004.
Kuznetsova, P.; Ordonez, V.; Berg, A. C.; Berg, T. L.; Choi, Y. Collective generation of natural image descriptions. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers, Vol. 1, 359–368, 2012.
Vedantam, R.; Zitnick, C. L.; Parikh, D. CIDEr: Consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4566–4575, 2015.
Denkowski, M.; Lavie, A. Meteor universal: Language specific translation evaluation for any target language. In: Proceedings of the 9th Workshop on Statistical Machine Translation, 2014.
De Marneffe, M.-C.; Manning, C. D. The Stanford typed dependencies representation. In: Proceedings of the Workshop on Cross-Framework and Cross-Domain Parser Evaluation, 1–8, 2008.
Author information
Authors and Affiliations
Corresponding author
Additional information
This article is published with open access at Springerlink.com
Jun Song received his B.Sc. degree from Tianjin University, China, in 2013. He is currently a Ph.D. candidate in computer science in the Digital Media Computing and Design Lab of Zhejiang University. His research interests include machine learning, cross-media information retrieval and understanding.
Siliang Tang received his B.Sc. degree from Zhejiang University, Hangzhou, China, and Ph.D. degree from the National University of Ireland, Maynooth, Co. Kildare, Ireland. He is currently a lecturer in the College of Computer Science, Zhejiang University. His current research interests include multimedia analysis, text mining, and statistical learning.
Jun Xiao received his B.Sc. and Ph.D. degrees in computer science from Zhejiang University in 2002 and 2007, respectively. Currently he is an associate professor in the College of Computer Science, Zhejiang University. His research interests include character animation and digital entertainment technology.
Fei Wu received his B.Sc. degree from Lanzhou University, China, in 1996, M.Sc. degree from the University of Macau, China, in 1999, and Ph.D. degree from Zhejiang University, Hangzhou, China, in 2002, all in computer science. He is currently a full professor in the College of Computer Science and Technology, Zhejiang University. His current research interests include multimedia retrieval, sparse representation, and machine learning.
Zhongfei (Mark) Zhang received his B.Sc. (Cum Laude) degree in electronics engineering, M.Sc. degree in information science, both from Zhejiang University, and Ph.D. degree in computer science from the University of Massachusetts at Amhers, USA. He is currently a full professor of computer science in the State University of New York (SUNY) at Binghamton, USA, where he directs the Multimedia Research Laboratory.
Open Access The articles published in this journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Other papers from this open access journal are available free of charge from http://www.springer.com/journal/41095. To submit a manuscript, please go to https://www. editorialmanager.com/cvmj.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0), which permits use, duplication, adaptation, distribution, and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Cite this article
Song, J., Tang, S., Xiao, J. et al. LSTM-in-LSTM for generating long descriptions of images. Comp. Visual Media 2, 379–388 (2016). https://doi.org/10.1007/s41095-016-0059-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s41095-016-0059-z