LSTM-in-LSTM for generating long descriptions of images

Song, Jun; Tang, Siliang; Xiao, Jun; Wu, Fei; Zhang, Zhongfei (Mark)

doi:10.1007/s41095-016-0059-z

LSTM-in-LSTM for generating long descriptions of images

Research Article
Open access
Published: 15 November 2016

Volume 2, pages 379–388, (2016)
Cite this article

Download PDF

You have full access to this open access article

Computational Visual Media Aims and scope Submit manuscript

LSTM-in-LSTM for generating long descriptions of images

Download PDF

Jun Song¹,
Siliang Tang¹,
Jun Xiao¹,
Fei Wu¹ &
…
Zhongfei (Mark) Zhang²

3395 Accesses
15 Citations
Explore all metrics

Abstract

In this paper, we propose an approach for generating rich fine-grained textual descriptions of images. In particular, we use an LSTM-in-LSTM (long short-term memory) architecture, which consists of an inner LSTM and an outer LSTM. The inner LSTM effectively encodes the long-range implicit contextual interaction between visual cues (i.e., the spatiallyconcurrent visual objects), while the outer LSTM generally captures the explicit multi-modal relationship between sentences and images (i.e., the correspondence of sentences and images). This architecture is capable of producing a long description by predicting one word at every time step conditioned on the previously generated word, a hidden vector (via the outer LSTM), and a context vector of fine-grained visual cues (via the inner LSTM). Our model outperforms state-of-theart methods on several benchmark datasets (Flickr8k, Flickr30k, MSCOCO) when used to generate long rich fine-grained descriptions of given images in terms of four different metrics (BLEU, CIDEr, ROUGE-L, and METEOR).

Article PDF

Picture it in your mind: generating high level visual representations from textual descriptions

Article 14 October 2017

phi-LSTM: A Phrase-Based Hierarchical LSTM Model for Image Captioning

NewsStories: Illustrating Articles with Visual Summaries

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

Farhadi, A.; Hejrati, M.; Sadeghi, M. A.; Young, P.; Rashtchian, C.; Hockenmaier, J.; Forsyth, D. Every picture tells a story: Generating sentences from images. In: Computer Vision—ECCV 2010. Daniilidis, K.; Maragos, P.; Paragios, N. Eds. Springer Berlin Heidelberg, 15–29, 2010.
Chapter Google Scholar
Kulkarni, G.; Premraj, V.; Ordonez, V.; Dhar, S.; Li, S.; Choi, Y.; Berg, A. C.; Berg, T. L. BabyTalk: Understanding and generating simple image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 35, No. 12, 2891–2903, 2013.
Article Google Scholar
Li, S.; Kulkarni, G.; Berg, T. L.; Berg, A. C.; Choi, Y. Composing simple image descriptions using web-scale n-grams. In: Proceedings of the 15th Conference on Computational Natural Language Learning, 220–228, 2011.
Google Scholar
Gong, Y.; Wang, L.; Hodosh, M.; Hockenmaier, J.; Lazebnik, S. Improving image-sentence embeddings using large weakly annotated photo collections. In: Computer Vision—ECCV 2014. Fleet, D.; Pajdla, T.; Schiele, B.; Tuytelaars, T. Eds. Springer International Publishing, 529–545, 2014.
Google Scholar
Hodosh, M.; Young, P.; Hockenmaier, J. Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research Vol. 47, 853–899, 2013.
MathSciNet MATH Google Scholar
Ordonez, V.; Kulkarni, G.; Berg, T. L. Im2text: Describing images using 1 million captioned photographs. In: Proceedings of Advances in Neural Information Processing Systems, 1143–1151, 2011.
Google Scholar
Socher, R.; Karpathy, A.; Le, Q. V.; Manning, C. D.; Ng, A. Y. Grounded compositional semantics for finding and describing images with sentences. Transactions of the Association for Computational Linguistics Vol. 2, 207–218, 2014.
Google Scholar
Karpathy, A.; Fei-Fei, L. Deep visual-semantic alignments for generating image descriptions. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 3128–3137, 2015.
Google Scholar
Mao, J.; Xu, W.; Yang, Y.; Wang, J.; Huang, Z.; Yuille, A. Deep captioning with multimodal recurrent neural networks (m-RNN). arXiv preprint arXiv:1412.6632, 2014.
Google Scholar
Vinyals, O.; Toshev, A.; Bengio, S.; Erhan, D. Show and tell: A neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3156–3164, 2015.
Google Scholar
Jin, J.; Fu, K.; Cui, R.; Sha, F.; Zhang, C. Aligning where to see and what to tell: Image caption with region-based attention and scene factorization. arXiv preprint arXiv:1506.06272, 2015.
Google Scholar
Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhutdinov, R.; Zemel, R. S.; Bengio, Y. Show, attend and tell: Neural image caption generation with visual attention. In: Proceedings of the 32nd International Conference on Machine Learning, 2048–2057, 2015.
Google Scholar
Bengio, Y.; Schwenk, H.; Senecal, J.-S.; Morin, F.; Gauvain, J.-L. Neural probabilistic language models. In: Innovations in Machine Learning. Holmes, D. E.; Jain, L. C. Eds. Springer Berlin Heidelberg, 137–186, 2006.
Chapter Google Scholar
Palangi, H.; Deng, L.; Shen, Y.; Gao, J.; He, X.; Chen, J.; Song, X.; Ward, R. Deep sentence embedding using the long short term memory network: Analysis and application to information retrieval. IEEE/ACM Transactions on Audio, Speech, and Language Processing Vol. 24, No. 4, 694–707, 2016.
Article Google Scholar
Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
Google Scholar
Krizhevsky, A.; Sutskever, I.; Hinton, G. E. Imagenet classification with deep convolutional neural networks. In: Proceedings of Advances in Neural Information Processing Systems, 1097–1105, 2012.
Google Scholar
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
Google Scholar
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 580–587, 2014.
Google Scholar
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. In: Computer Vision—ECCV 2014. Fleet, D.; Pajdla, T.; Schiele, B.; Tuytelaars, T. Eds. Springer International Publishing, 346–361, 2014.
Google Scholar
Girshick, R. Fast r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision, 1440–1448, 2015.
Google Scholar
Karpathy, A.; Joulin, A.; Li, F. F. F. Deep fragment embeddings for bidirectional image sentence mapping. In: Proceedings of Advances in Neural Information Processing Systems, 1889–1897, 2014.
Google Scholar
Elliott, D.; Keller, F. Image description using visual dependency representations. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, 1292–1302, 2013.
Google Scholar
Sutton, R. S.; Barto, A. G. Reinforcement Learning: An Introduction. The MIT Press, 1998.
Google Scholar
Sutskever, I.; Vinyals, O.; Le, Q. V. Sequence to sequence learning with neural networks. In: Proceedings of Advances in Neural Information Processing Systems, 3104–3112, 2014.
Google Scholar
Graves, A. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.
Google Scholar
Young, P.; Lai, A.; Hodosh, M.; Hockenmaier, J. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics Vol. 2, 67–78, 2014.
Google Scholar
Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollar, P.; Zitnick, C. L. Microsoft COCO: Common objects in context. In: Computer Vision—ECCV 2014. Fleet, D.; Pajdla, T.; Schiele, B.; Tuytelaars, T. Eds. Springer International Publishing, 740–755, 2014.
Google Scholar
Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.-J. BLEU: A method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, 311–318, 2002.
Google Scholar
Lin, C.-Y. ROUGE: A package for automatic evaluation of summaries. In: Text Summarization Branches Out: Proceedings of the ACL-04 Workshop, Vol. 8, 2004.
Kuznetsova, P.; Ordonez, V.; Berg, A. C.; Berg, T. L.; Choi, Y. Collective generation of natural image descriptions. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers, Vol. 1, 359–368, 2012.
Google Scholar
Vedantam, R.; Zitnick, C. L.; Parikh, D. CIDEr: Consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4566–4575, 2015.
Google Scholar
Denkowski, M.; Lavie, A. Meteor universal: Language specific translation evaluation for any target language. In: Proceedings of the 9th Workshop on Statistical Machine Translation, 2014.
Google Scholar
De Marneffe, M.-C.; Manning, C. D. The Stanford typed dependencies representation. In: Proceedings of the Workshop on Cross-Framework and Cross-Domain Parser Evaluation, 1–8, 2008.
Google Scholar

Download references

Author information

Authors and Affiliations

College of Computer Science and Technology, Zhejiang University, Hangzhou, 310027, China
Jun Song, Siliang Tang, Jun Xiao & Fei Wu
Department of Computer Science, Watson School of Engineering and Applied Sciences, Binghamton University, Binghamton, NY, USA
Zhongfei (Mark) Zhang

Authors

Jun Song
View author publications
You can also search for this author in PubMed Google Scholar
Siliang Tang
View author publications
You can also search for this author in PubMed Google Scholar
Jun Xiao
View author publications
You can also search for this author in PubMed Google Scholar
Fei Wu
View author publications
You can also search for this author in PubMed Google Scholar
Zhongfei (Mark) Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fei Wu.

Additional information

This article is published with open access at Springerlink.com

Jun Song received his B.Sc. degree from Tianjin University, China, in 2013. He is currently a Ph.D. candidate in computer science in the Digital Media Computing and Design Lab of Zhejiang University. His research interests include machine learning, cross-media information retrieval and understanding.

Siliang Tang received his B.Sc. degree from Zhejiang University, Hangzhou, China, and Ph.D. degree from the National University of Ireland, Maynooth, Co. Kildare, Ireland. He is currently a lecturer in the College of Computer Science, Zhejiang University. His current research interests include multimedia analysis, text mining, and statistical learning.

Jun Xiao received his B.Sc. and Ph.D. degrees in computer science from Zhejiang University in 2002 and 2007, respectively. Currently he is an associate professor in the College of Computer Science, Zhejiang University. His research interests include character animation and digital entertainment technology.

Fei Wu received his B.Sc. degree from Lanzhou University, China, in 1996, M.Sc. degree from the University of Macau, China, in 1999, and Ph.D. degree from Zhejiang University, Hangzhou, China, in 2002, all in computer science. He is currently a full professor in the College of Computer Science and Technology, Zhejiang University. His current research interests include multimedia retrieval, sparse representation, and machine learning.

Zhongfei (Mark) Zhang received his B.Sc. (Cum Laude) degree in electronics engineering, M.Sc. degree in information science, both from Zhejiang University, and Ph.D. degree in computer science from the University of Massachusetts at Amhers, USA. He is currently a full professor of computer science in the State University of New York (SUNY) at Binghamton, USA, where he directs the Multimedia Research Laboratory.

Open Access The articles published in this journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Other papers from this open access journal are available free of charge from http://www.springer.com/journal/41095. To submit a manuscript, please go to https://www. editorialmanager.com/cvmj.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0), which permits use, duplication, adaptation, distribution, and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Cite this article

Song, J., Tang, S., Xiao, J. et al. LSTM-in-LSTM for generating long descriptions of images. Comp. Visual Media 2, 379–388 (2016). https://doi.org/10.1007/s41095-016-0059-z

Download citation

Received: 25 July 2016
Accepted: 19 August 2016
Published: 15 November 2016
Issue Date: December 2016
DOI: https://doi.org/10.1007/s41095-016-0059-z

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

LSTM-in-LSTM for generating long descriptions of images

Abstract

Article PDF

Similar content being viewed by others

Picture it in your mind: generating high level visual representations from textual descriptions

phi-LSTM: A Phrase-Based Hierarchical LSTM Model for Image Captioning

NewsStories: Illustrating Articles with Visual Summaries

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

LSTM-in-LSTM for generating long descriptions of images

Abstract

Article PDF

Similar content being viewed by others

Picture it in your mind: generating high level visual representations from textual descriptions

phi-LSTM: A Phrase-Based Hierarchical LSTM Model for Image Captioning

NewsStories: Illustrating Articles with Visual Summaries

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation