Abstract
Referring image segmentation is a multimodal joint task that aims to segment linguistically indicated objects from images in paired expressions and images. However, the diversity of language annotations trends to result in semantic ambiguity, which makes the semantic representation of language feature encoding imprecise. Existing methods ignore the correction of language encoding module, so that the semantic error of language features cannot be improved in the subsequent process, resulting in semantic deviation. To this end, we propose a vision-aware language reasoning model. Intuitively, the segmentation result can be used to guide the reconstruction of language features, which could be expressed as a tree-structured recursive process. Specifically, we designed a language reasoning encoding module and a mask loopback optimization module to optimize the language encoding tree. The feature weights of tree nodes are learned through backpropagation. In order to overcome the problem that local language words and visual regions are easily introduced into noise regions in the traditional attention module, we use the global language prior information to calculate the importance of different words to further weight the visual region features, which could be embodied as language-aware vision attention module. Our experimental results on four benchmark datasets show that the proposed method achieves performance improvement.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Hu R, Rohrbach M, Darrell T (2016) Segmentation from natural language expressions. arXiv:1603.06180
Chen J, Shen Y, Gao J, Liu J, Liu X (2018) Language-based image editing with recurrent attentive models. In: 2018 IEEE/CVF conference on computer vision and pattern recognition, pp 8721–8729
Linder J, Laput G, Dontcheva M, Wilensky G, Chang W, Agarwala A, Adar E (2013) Pixeltone: a multimodal interface for image editing. In: CHI’13 extended abstracts on human factors in computing systems
Wang XE, Huang Q, Celikyilmaz A, Gao J, Shen D, Wang Y-F, Wang WY, Zhang L (2019) Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. In: 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 6622–6631
Lin T-Y, Maire M, Belongie SJ, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context. In: ECCV
Zhou B, Zhao H, Puig X, Fidler S, Barriuso A, Torralba A (2018) Semantic understanding of scenes through the ade20k dataset. Int J Comput Vis 127:302–321
Wu T, Huang J, Gao G, Wei X, Wei X, Luo X, Liu CH (2021) Embedded discriminative attention mechanism for weakly supervised semantic segmentation. In: 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 16760–16769
Vaswani A, Shazeer NM, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. arXiv:1706.03762
Yang Z, Wang J, Tang Y, Chen K, Zhao H, Torr PHS (2021) Lavt: language-aware vision transformer for referring image segmentation. arXiv:2112.02244
Kim NH, Kim D, Lan C, Zeng W, Kwak S (2022) Restr: convolution-free referring image segmentation using transformers. arXiv:2203.16768
Li Z, Wang M, Mei J, Liu Y (2021) Mail: a unified mask-image-language trimodal network for referring image segmentation. arXiv:2111.10747
Li R, Li K, Kuo Y-C, Shu M, Qi X, Shen X, Jia J (2018) Referring image segmentation via recurrent refinement networks. In: 2018 IEEE/CVF conference on computer vision and pattern recognition, pp 5745–5753
Liu C, Lin ZL, Shen X, Yang J, Lu X, Yuille AL (2017) Recurrent multimodal interaction for referring image segmentation. In: 2017 IEEE international conference on computer vision (ICCV), pp 1280–1289
Chen D-J, Jia S, Lo Y-C, Chen H-T, Liu T-L (2019) See-through-text grouping for referring image segmentation. In: 2019 IEEE/CVF international conference on computer vision (ICCV), pp 7453–7462
Hu Z, Feng G, Sun J, Zhang L, Lu H (2020) Bi-directional relationship inferring network for referring image segmentation. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 4423–4432
Shi H, Li H, Meng F, Wu Q (2018) Key-word-aware network for referring expression image segmentation. In: ECCV
Feng G, Hu Z, Zhang L, Lu H (2021) Encoder fusion network with co-attention embedding for referring image segmentation. In: 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 15501–15510
Huang S, Hui T, Liu S, Li G, Wei Y, Han J, Liu L, Li B (2020) Referring image segmentation via cross-modal progressive comprehension. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 10485–10494
Hui T, Liu S, Huang S, Li G, Yu S, Zhang F, Han J (2020) Linguistic structure guided context modeling for referring image segmentation. arXiv:2010.00515
Yang S, Xia M, Li G, Zhou H-Y, Yu Y (2021) Bottom-up shift and reasoning for referring image segmentation. In: 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 11261–11270
Ye L, Rochan M, Liu Z, Wang Y (2019) Cross-modal self-attention network for referring image segmentation. In: 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 10494–10503
Lin L, Yan P, Xu X, Yang S, Zeng K, Li G (2022) Structured attention network for referring image segmentation. IEEE Trans Multimed 24:1922–1932
Chen D, Manning CD (2014) A fast and accurate dependency parser using neural networks. In: EMNLP
Yu L, Poirson P, Yang S, Berg AC, Berg TL (2016) Modeling context in referring expressions. arXiv:1608.00272
Mao J, Huang J, Toshev A, Camburu O-M, Yuille AL, Murphy KP (2016) Generation and comprehension of unambiguous object descriptions. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), pp 11–20
Kazemzadeh S, Ordonez V, Andre Matten M, Berg TL (2014) Referitgame: referring to objects in photographs of natural scenes. In: EMNLP
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9:1735–1780
Jing Y, Kong T, Wang W, Wang L, Li L, Tan T (2021) Locate then segment: a strong pipeline for referring image segmentation. In: 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 9853–9862
Devlin J, Chang M-W, Lee K, Toutanova K (2019) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805
Wang Z, Lu Y, Li Q, Tao X, Guo Y, Gong M, Liu T (2021) Cris: clip-driven referring image segmentation
Chen L-C, Papandreou G, Schroff F, Adam H (2017) Rethinking atrous convolution for semantic image segmentation. arXiv:1706.05587
Luo G, Zhou Y, Sun X, Cao L, Wu C, Deng C, Ji R (2020) Multi-task collaborative network for joint referring expression comprehension and segmentation. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 10031–10040
Redmon J, Farhadi A (2018) Yolov3: an incremental improvement. arXiv:1804.02767
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N (2021) An image is worth 16 \(\times \) 16 words: transformers for image recognition at scale. arXiv:2010.11929
Yu L, Lin ZL, Shen X, Yang J, Lu X, Bansal M, Berg TL (2018) Mattnet: modular attention network for referring expression comprehension. In: 2018 IEEE/CVF conference on computer vision and pattern recognition, pp 1307–1315
Yang S, Li G, Yu Y (2020) Graph-structured referring expression reasoning in the wild. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 9949–9958
Yang S, Li G, Yu Y (2019) Dynamic graph attention for referring expression comprehension. In: 2019 IEEE/CVF international conference on computer vision (ICCV), pp 4643–4652
Liu D, Zhang H, Zha Z, Wu F (2019) Learning to assemble neural module tree networks for visual grounding. In: 2019 IEEE/CVF international conference on computer vision (ICCV), pp 4672–4681
Hong R, Liu D, Mo X, He X, Zhang H (2022) Learning to compose and reason with language tree structures for visual grounding. IEEE Trans Pattern Anal Mach Intell 44:684–696
Cao Q, Liang X, Li B, Lin L (2021) Interpretable visual question answering by reasoning on dependency trees. IEEE Trans Pattern Anal Mach Intell 43:887–901
Cao Q, Liang X, Li B, Li G, Lin L (2018) Visual question reasoning on general dependency tree. In: 2018 IEEE/CVF conference on computer vision and pattern recognition, pp 7249–7257
Ben-younes H, Cadène R, Cord M, Thome N (2017) Mutan: multimodal tucker fusion for visual question answering. In: 2017 IEEE international conference on computer vision (ICCV), pp 2631–2639
Margffoy-Tuay E, Pérez J, Botero E, Arbeláez P (2018) Dynamic multimodal instance segmentation guided by natural language queries. arXiv:1807.02257
Luo G, Zhou Y, Ji R, Sun X, Su J, Lin C-W, Tian Q (2020) Cascade grouped attention network for referring expression segmentation. In: Proceedings of the 28th ACM international conference on multimedia
Escalante HJ, Hernández CA, Gonzalez JA, López-López A, Montes-y-Gómez M, Morales EF, Sucar LE, Pineda LV, Grubinger M (2010) The segmented and annotated IAPR TC-12 benchmark. Comput Vis Image Underst 114:419–428
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), pp 770–778
Everingham M, Gool LV, Williams CKI, Winn JM, Zisserman A (2009) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88:303–338
Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation. In: EMNLP
Schuster M, Paliwal KK (1997) Bidirectional recurrent neural networks. IEEE Trans Signal Process 45:2673–2681
Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: CoRR arXiv:1412.6980
Paszke A, Gross S, Chintala S, Chanan G, Yang E, DeVito Z, Lin Z, Desmaison A, Antiga L, Lerer A (2017) Automatic differentiation in pytorch
Krähenbühl P, Koltun V (2011) Efficient inference in fully connected CRFs with Gaussian edge potentials. In: NIPS
Author information
Authors and Affiliations
Contributions
FX completed the experiment and writing; BL provides guidance on innovation and methods, and modifies papers; CZ provided guidance on methods and revised the paper; LX completed the formalization of the formula; MP and BL supplemented and improved the experiment. All authors reviewed the manuscript.
Corresponding authors
Ethics declarations
Conflict of interest
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Xu, F., Luo, B., Zhang, C. et al. Vision-Aware Language Reasoning for Referring Image Segmentation. Neural Process Lett 55, 11313–11331 (2023). https://doi.org/10.1007/s11063-023-11377-z
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11063-023-11377-z