Keywords

1 Introduction

Image segmentation expects to segment out objects of interest from an image, which is a fundamental step towards high-level vision tasks, such as object extraction [14, 23, 25], image captioning [21, 32, 34] and visual question answering [21, 22, 35]. This paper focuses on referring expression image segmentation, in which the objects of interest are referred by natural language expressions, as shown in Fig. 1. Beyond traditional semantic segmentation, referring expression image segmentation needs to analyze both the image and natural language, which is a more challenging task.

Fig. 1.
figure 1

Example of referring expression image segmentation task. Different from traditional image segmentation, referring expression image segmentation aims at segmenting out the object referred by a natural language query expression.

Previous works [9, 10, 18] formulate referring expression image segmentation task as a region-wise foreground/background classification problem. They combine each image region feature with whole query feature [9, 10] or every word feature [18] to classify the image region. However, each word in a query expression makes different contributions to identify the desired object, which requires a differential treatment in extracting text feature. Extracting key words is helpful to suppress the noise in the query and to highlight the desired objects. In addition, existing methods also ignore the visual context among different image regions. Visual context is important to localize and recognize objects. In Fig. 1 we illustrate an example, which includes two foreground objects, i.e., the bride and the groom. It is clear that the groom is on the right side of the bench, which is important to match the query expression.

In this paper, we propose a key-word-aware network (KWAN) that extracts key words for each image region and models the key-word-aware visual context among multiple image regions in accordance with the natural language query. Firstly, we use a convolutional neural network (CNN) and a recurrent neural network (RNN) to encode the features of every image region and every word, respectively. Based on these features, we then find out the key words for each image region by a query attention model. Next, a key-word-aware visual context model is used to model the visual context among multiple image regions in accordance with corresponding key words. Finally, we classify each image region based on extracted visual features, key-word-aware visual context features and corresponding key word features. We verify the proposed method on the ReferItGame and Google-Ref datasets. The results show that our method outperforms previous state-of-the-art methods and achieves the best IoU and precision.

This paper is organized as follows. We introduce the related work in Sect. 2. In Sect. 3, we detail our proposed method for referring expression image segmentation. Experimental results are reported in Sect. 4 to validate the effectiveness of our method. Finally, Sect. 5 concludes this paper.

2 Related Work

In summary, there are three categories of works related with the task of this paper. The first is semantic segmentation, which is one of the most classic tasks in image segmentation and a foundation for referring expression image segmentation. The second is referring expression visual localization, which also needs to search object in an given image from natural language expressions. The third is referring expression image segmentation.

Semantic Segmentation. Semantic segmentation technologies have developed quickly in recent years, on which convolutional neural network (CNN)-based methods achieve state-of-the-art performance. The CNN-based semantic segmentation methods can be mainly divided into two types. The first is hybrid proposal-classifier models [1, 4,5,6,7], which first generate a number of proposals from the input image, and then segment out the foreground object in each proposal. The second is fully convolutional networks (FCNs) [2, 20, 27, 36], which segment the whole image end-to-end, without any pre-processing. Some methods [3, 15, 16, 19, 28, 39] leveraged visual context model to boost the semantic segmentation performance, which models the relationships among multiple image regions based on their spatial positions. Wang et al. [31] built an interaction between semantic segmentation and natural language. They extract an object relationship distribution from natural language descriptions, and then use the extracted distribution to constrain the object categories in semantic segmentation predictions. These semantic segmentation methods are foundations for referring expression image segmentation task.

Referring Expression Visual Localization. Referring expression visual localization expects to localize regions in an image from natural language expressions. The goal of this task is to find bounding boxes [11, 24, 26, 37, 38] or attention regions [21, 22, 32, 33, 35] referred by natural language queries. Methods in [11, 24, 26, 37, 38] first restored the natural language expressions from a number of pre-extracted proposals, and then took the proposal with the highest restoration score as the referred object. Methods proposed by [21, 22, 32, 33, 35] used visual attention models to measure the importance of each image region for image captioning [21, 32] or visual question answering [21, 22, 35] task. The most important regions were deemed as attention regions. The similarity between these localization methods and referring expression image segmentation methods is that they both need to find out objects referred by natural language queries. However, these localization methods only focus on generating bounding boxes or coarse attention maps, while referring expression image segmentation methods aim at obtaining fine segmentation masks.

Referring Expression Image Segmentation. Referring expression image segmentation have attracted increasing researchers’ interest [9, 10, 18] in recent years. Beyond referring expression visual localization and semantic segmentation, referring expression image segmentation aims at generating fine segmentation masks from natural language queries. Hu et al. [9, 10] combined the features of the natural language query and each image to determine whether the image region belongs to the referred object. Liu et al. [18] developed the referring expression image segmentation technologies. Instead of directly using the feature of whole query, they concatenated the features of each word and each image region, and then used a multimodal LSTM to integrate these concatenated features. However, on one hand, these methods ignore that each word in a query makes different contributions to the segmentation. On the other hand, many queries need to compare multiple image regions, while these methods only separately tackle each image region. In contrast to previous methods, we propose a key-word-aware network, which extracts key words to suppress the noise in queries, and models key-word-aware visual context among multiple image regions to better localize and recognize objects.

Fig. 2.
figure 2

Our proposed key-word-aware Network (KWAN) consists of four parts: (a) a CNN and an RNN that encode the features of every image region and every word in the nature language query, (b) a query attention model that extracts key words for each image region and use extracted key words to weight the original query, (c) a key-word-aware visual context model that models visual context based on corresponding key words, (d) a prediction model that predict the segmentation results based on visual features, key-word-aware visual context features and key-word-weight query features.

3 Proposed Method

Overview. Given an image and a natural language query, our goal is to segment out the object referred by the query from the image. To this end, we propose a key-word-aware network (KWAN), which is composed of four parts as illustrated in Fig. 2. The first part is a feature extractor, which encodes features of the image and query. The second part is a query attention model, which extracts key words for each image region and leverages these key words to weight the query feature. The third part is a key-word-aware visual context model, which models the visual context among multiple image regions based on the natural language query. The fourth part is a prediction model, which generates segmentation predictions based on the image features, the key-word-weighted query features and the key-word-aware visual context features. Below, we detail each part.

3.1 Image and Query Feature Extractor

The inputs in referring expression image segmentation task contain two parts: an image \(I \in R^{H \times W \times C_{im}}\) and a natural language query \(X \in R^{C_{text} \times T }\), where H and W are the height and width of the image, respectively; \(C_{im}\) is the number of image channels; T denotes the number of words in the query; and each word represented by an \(C_{text}\)-dimensional one-hot vector. We first use a convolutional neural network (CNN) to extract a feature map of the input image as follows:

$$\begin{aligned} \begin{aligned} F&= CNN(I) \\&= \{f_{1}, f_{2}, ..., f_{hw}\} \end{aligned} \end{aligned}$$
(1)

where \(F \in R^{h \times w \times C_{f}}\) is the extracted feature map; h and w are the height and width of feature map, respectively; and \(C_{f}\) is the feature dimension. In the feature map F, each feature vector \(f_{i} \in R^{C_{f}}\) encodes the appearance and semantic information of the i-th image region.

Since the referring expression image segmentation task also needs spatial position information, we extract a position feature from the spatial coordinates of the i-th image region:

$$\begin{aligned} p_{i} = [x_{i}, y_{i}] \end{aligned}$$
(2)

where \(p_{i} \in R^{2}\) is the position feature of the i-th image region, which is concatenated by the normalized horizontal and vertical coordinates \(x_{i}\) and \(y_{i}\). The operator \([\cdot , \cdot ]\) represents the concatenation of features. Therefore, the final visual feature of the i-th image region can be obtained as follows:

$$\begin{aligned} v_{i} = [f_{i}, p_{i}] \end{aligned}$$
(3)

where \(v_{i} \in R^{C{v}}\) is a C v-dimensional visual feature vector of the i-th image region, and \(C{v} = C_{f} + 2\). The visual feature contains appearance, semantic and spatial position information of the image region.

We use a recurrent neural network (RNN) to encode the feature of natural language query X as follows:

$$\begin{aligned} \begin{aligned} Q&= RNN(W_{e}X) \\&= \{q_{1}, q_{2}, ..., q_{T}\} \end{aligned} \end{aligned}$$
(4)

where \(Q \in R^{C_{q} \times T}\) is the encoded feature matrix of the query X, in which each feature vector \(q_{t} \in R^{C_{q}}\) encodes the textual semantic and contextual information for the t-th word. \(W_{e} \in R^{C_{e} \times C_{text}}\) is a word embedding matrix to reduce the dimensionality of the word features.

3.2 Query Attention Model

After the feature encoding, we then extract key words by a query attention model. For the i-th image region, the query attention can be captured as follows:

$$\begin{aligned} z_{i,t} = w_{z}^{T}tanh(W_{q}q_{t} + W_{v}v_{i}) \end{aligned}$$
(5)
$$\begin{aligned} \alpha _{i,t} = \frac{exp(z_{i,t})}{\sum _{r=1}^T exp(z_{i,r})} \end{aligned}$$
(6)

where \(W_{q} \in R^{C_{z} \times C_{q}}\), \(W_{v} \in R^{C_{z} \times C_{v}}\) and \(w_{z} \in R^{C_{z}}\) are parameters in query attention model; \(\alpha _{i,t} \in [0,1]\) is the query attention score of the t-th word for the i-th image region, and \(\sum _{t=1}^T \alpha _{i,t} = 1\). A high score \(\alpha _{i,t}\) means that the t-th word is important for i-th image region, i.e., word t is a key word for image region i.

Based on the learned query attention scores, the feature of query can be weighted as follows:

$$\begin{aligned} \hat{q}_{i} = \sum _{t=1}^T \alpha _{i,t}q_{t} \end{aligned}$$
(7)

where \(\hat{q}_{i} \in R^{C_{q}}\) is the weighted query feature for the i-th image region. In the weighted query feature, words are no longer equally important. Key words make more important contributions.

3.3 Key-Word-Aware Visual Context Model

The key-word-aware visual context model learns the context among multiple image regions for the natural language query. Towards this goal, we first aggregate the visual messages of image regions for each key word:

$$\begin{aligned} m_{t} = \left\{ \begin{aligned}&\frac{\sum _{i=1}^{hw} v_{i} u(\alpha _{i,t} - Thr)}{\sum _{i=1}^{hw} u(\alpha _{i,t} - Thr)} ,&\max _{i=1,...,hw}(\alpha _{i,t}) \ge Thr \\&\mathbf {0} ,&otherwise \end{aligned} \right. \end{aligned}$$
(8)

where \(m_{t} \in R^{C_{v}}\) is the aggregated visual feature vector, and \(u(\cdot )\) represents an unit step function. Thr is a threshold to select out the key word. \(\alpha _{i,t} \ge Thr\) implies that the t-th word is a key word for the i-th image region. If the t-th word is a key word for at least one image region (i.e., \(\max _{i=1,...,hw}(\alpha _{i,t}) \ge Thr\)), we average the visual features of image regions which take this word as a key word. Otherwise, the t-th word is a non-key word for whole image, hence the aggregated visual feature \(m_{t}\) is \(\mathbf {0}\). The threshold Thr is set to 1/T, since \(\sum _{t=1}^T \alpha _{i,t} = 1\).

Based on the aggregated visual messages, we then use a fully-connected layer to learn visual context:

$$\begin{aligned} g_{t} = ReLU(W_{g}m_{t}+b_{g}) \end{aligned}$$
(9)

where \(g_{t}\in R^{C_{g}}\) is the learned visual context feature specific to the t-th word, \(W_{g}\in R^{C_{g} \times C_{v}}\) and \(b_{g} \in R^{C_{g}}\) are the parameters in the fully-connected layer, and ReLU denotes the rectified linear unit activation function.

Finally, we fuse the visual context features specific to each key words into the one specific to whole query as follows:

$$\begin{aligned} c_{i} = \sum _{t=1}^{T} g_{t}u(\alpha _{i,t} - Thr) \end{aligned}$$
(10)

where \(c_{i} \in R^{C_{g}}\) is the fused visual context feature specific to the query for the i-th image region.

3.4 Prediction Model and Loss Function

Once we extract the visual feature \(v_{i}\), the key-word-weighted query feature \(\hat{q}_{i}\) and the key-word-aware visual context feature \(c_{i}\), a correlation score between the query and each image region can be obtained as follows:

$$\begin{aligned} s_{i} = sigmoid(MLP([\hat{q}_{i}, v_{i}, c_{i}])) \end{aligned}$$
(11)

where MLP denotes a multi-layer perceptron, and sigmoid function are used to normalize the score. \(s_{i} \in (0,1)\) is the normalized correlation score between i-th image region and the natural language query. A high correlation score means that current image region is highly correlative with the query, i.e., this image region is belong to referred foreground object.

Scores of all image regions together form a label map. We upsample the label map into original image size as the segmentation result. A pixel-wise cross entropy loss is used to constrain the training:

$$\begin{aligned} \begin{aligned} Loss =&-\frac{1}{N} \sum _{n=1}^N \frac{1}{H^{(n)}W^{(n)}}\sum _{j=1}^{H^{(n)}W^{(n)}} [ y_{j}^{(n)} \times log s_{j}^{(n)} \\&+\,(1-y_{j}^{(n)}) \times log (1-s_{j}^{(n)})] \end{aligned} \end{aligned}$$
(12)

where N is the number of images in total training set; \(H^{(n)}\) and \(W^{(n)}\) are the height and width of the n-th image, respectively; \(s_{j}^{(n)}\) denotes the correlation score of the j-th pixel in the n-th image; and \(y_{j}^{(n)} \in \{0,1\}\) is the label indicating whether pixel j belongs to referred object.

4 Experiments

We conduct experiments to evaluate our method on two challenging referring expression image segmentation datasets, including the ReferItGame dataset and the Google-Ref dataset. Objective and subjective results are reported in this section.

Evaluation Metrics. We adopt two typical image segmentation metrics: the intersection-over-union (IoU) and the precision (Pr). The IoU is a ratio between intersection and union areas of segmentation results and ground truth. The precision is the percentage of correctly segmented objects in the total dataset. The correctly segmented objects are defined as objects whose IoU passes a pre-set threshold. We use five different thresholds in experiments: 0.5, 0.6, 0.7, 0.8, 0.9. The precisions with these thresholds are represented by Pr@0.5, Pr@0.6, Pr@0.7, Pr@0.8, Pr@0.9, respectively.

Implementation Details. The proposed method can be implemented with any CNN and RNN. Since state-of-the-art methods [9, 18] often choose VGG16 [30] or Deeplab101 [2] as their CNN and use LSTM [8] as their RNN, to fairly compare our method with them, we also implement the proposed method with these CNN and RNN in our experiments. The dimensions of CNN and RNN features are both set to 1000 (i.e., \(C_{f}=C_{q}=1000\)). The maximum number T of words in a query is 20, thus the key word threshold Thr in the key-word-aware visual context model is set to 0.05 (i.e., 1/T). We train the proposed method in two stages. The first stage is low resolution training. In this stage, the predictions are not upsampled, and the loss is calculated with down-sampled ground truth. The second stage is high resolution training, in which the predictions are upsampled into the original image size. The model is trained with Adaptive Moment Estimation (Adam) in all stages, and the learning rate is set to 0.0001. We initialize the CNN from the weights pre-trained on ImageNet dataset [29], and initialize other parts from random weights. All experiments are conducted based on the Caffe [12] toolbox on a single Nvidia GTX Titan X GPU with 12G memory.

Table 1. Comparison with state-of-the-art methods on the ReferItGame testing.

4.1 Results on ReferItGame Dataset

The ReferItGame dataset [13] is a public dataset, with 20000 natural images and 130525 natural language expressions. Totalling 96654 foreground regions are referred by these expressions, which contain not only objects but also stuff, such as snow, mountain and so on. The dataset are split into training, validation and testing sets, containing 9000, 1000, and 10000 images, respectively. Similar to [9, 18], we use training and validation sets to train, and use testing set to test our method.

The results are summarized in Table 1. All methods do not use additional training data and post processing like CRF. State-of-the-art methods in [9, 18] equally deal with every word in the natural language expressions and do not take into account the visual context. It can be observed from Table 1 that our proposed method outperforms these methods in terms of both IoU and precision, whether implemented with VGG16 or Deeplab101. Moreover, under the precision metric, with higher thresholds, our method achieves more improvements. This superior performance demonstrates the effectiveness of selectively extracting key words for every image region and modeling the key-word-aware visual context.

Fig. 3.
figure 3

Referring expression image segmentation results on the ReferItGame testing. Left to right: input images, ground truth, the segmentation results from [9, 18] and our method, respectively. All methods are implemented with Deeplab101. In query expressions, the black words mean key words our method predicted for foreground regions (red regions). (Color figure online)

We depict some subjective referring expression image segmentation results on the ReferItGame dataset in Fig. 3. From the first and third images in Fig. 3, it can be seen that existing methods do not well segment out some objects when the query expression is too long or contains some noise, such as round brackets. Our method selects key words and filters out useless information in the query, therefore can successfully segment out the referred objects in these images. Moreover, it can be observed that previous method localize and segment some desired objects wrongly when the query needs to compare multiple objects, such as the second and fourth images in Fig. 3. A major reason is that previous methods ignore the visual context among objects. Our method can generate better segmentation results by modeling the key-word-aware visual context.

Table 2. Comparison with state-of-the-art methods on the Google-Ref validation.
Fig. 4.
figure 4

Referring expression image segmentation results on the Google-Ref validation. Left to right: input images, ground truth, the segmentation results from [9, 18] and our method, respectively. All methods are implemented with Deeplab101. In query expressions, the black words mean key words our method predicted for foreground regions (red regions). (Color figure online)

4.2 Results on Google-Ref Dataset

The Google-Ref dataset [24] contains 26711 natural images with 54822 objects extracted from the MS COCO dataset [17]. There are 104560 expressions referring to these objects, and the average length of these expressions is longer than that in the ReferItGame dataset. We use the split from [24], which chose 44822 and 5000 objects for training and validation, respectively.

The objective and subjective results are shown in Table 2 and Fig. 4, respectively. From Table 2, it can be seen that our method outperforms previous methods under the both two metrics, IoU and precision. This demonstrates the effectiveness of our method. From Fig. 4, it can be observed that previous methods fail to segment some objects when the queries are too long, such as the first and second images in Fig. 4. In addition, previous methods find some wrong object instances when the queries need to compare different instances with the same class, such as the third and fourth images in Fig. 4. The proposed method can successfully segment out these objects, benefiting from the key word extraction and the key-word-aware visual context.

4.3 Discussion

Ablation Study. To verify the effectiveness of each part in our method, a number of ablation studies are conducted on the ReferItGame dataset. We compare five different models as follows:

  1. 1.

    Baseline: We take the method in [9] as the baseline model, which classify each image region with whole query feature and do not model visual context.

  2. 2.

    Key-word-model: Instead of using whole query, we extract key words for every image region, but the visual context is not used in this model.

  3. 3.

    Context-model: We extract key words for every image region and leverage spatial pyramid pooling to model visual context, which is only based on the visual information.

  4. 4.

    Full-model: Full-model extracts key words for every image region and models key-word-aware visual context, which is not only based on vision but also the nature language query.

  5. 5.

    Soft-model: Soft-model also extracts key words and models key-word-aware visual context. In this model, we use a soft attention model to aggregate the context instead of the unit step function described in Sect. 3.3.

Fig. 5.
figure 5

Visualized results of the ablation studies on the ReferItGame testing. Left to right: input images, ground truth, the segmentation results from baseline model [9], key-word-model, context-model and full-model, respectively. All models are implemented with VGG16.

Table 3. Comparison of different ablation models on the ReferItGame testing. “Soft” means that the key-word-aware visual context is calculated by a soft attention model instead of the unit step function. All models are implemented with VGG16.

The results of ablation studies are shown in Table 3. It can be seen that (1) using key words is better than using whole query; (2) visual context is effective to improve the performance; (3) compared with the context only based on vision, key-word-aware visual context can further improve the referring expression image segmentation performance; (4) the performance of soft-attention-based model is comparable with that of the unit-step-function-based model. However, the computation cost of soft attention is much higher than that of unit step function. Therefore, we use the unit step function instead of the soft attention.

We visualize some results of different ablation models in Fig. 5. It can observed that the baseline model almost does not predict any foreground object region for some queries, due to that it fails to mine semantic from these query expressions. Key-word-model mines key words from the queries, thus it generates some foreground predictions. However, key-word-model still cannot segment out the referred objects, because it separately classify each image region, while these queries need to compare multiple regions. Context-model improves the segmentation results by modeling visual context among image regions, but it also fails to segment out these objects. A major reason is that the context-model ignores the relationship between visual context and the natural language queries. Our full-model extracts key words and models key-word-aware visual context, therefore successfully segments out these objects.

Fig. 6.
figure 6

Visualization of key words for some image regions on the ReferItGame testing. Left to right: input images, key words (black words) for image regions (red, green and blue points), and segmentation results from our full model implemented with VGG16 (Color figure online).

Table 4. IoU for queries of different lengths on the ReferItGame testing. All methods are implemented with VGG16.

Key Word. Tables 4 and 5 show the segmentation performance for queries of different lengths. It can be observed that compared with existing methods, the proposed method yields more gains when deals with longer queries. This demonstrates that using key words instead of whole queries is effective, especially when tackling long queries. Figure 6 depicts visualized examples of extracted key words for some image regions. For example, in the second image in Fig. 6, only according to the word cap, the green regions can be eliminated from the desired foreground object, because they are not caps.

Table 5. IoU for different length queries on Google-Ref validation. All methods are implemented with VGG16.
Fig. 7.
figure 7

Failure cases on the ReferItGame dataset. Left to right: input images, ground truth, correlation score maps and segmentation results from our method implemented with VGG16.

Failure Case. Some failure cases are shown in Fig. 7. One type of failures occurs when queries contain low-frequency or new words. For example, in the first image in Fig. 7, blanket rarely appears in the training data. As a result, our method does not segment out the blanket, although it has already highlighted the right white regions in the background. Another case is that our method sometimes fails to segment out small objects. For instance, in the second image in Fig. 7, our method highlights the left of the background, but does not segment out the person, because it is very small. This problem may be alleviated by enlarging the scale of input images.

5 Conclusion

This paper has presented key-word-aware network (KWAN) for referring expression image segmentation. KWAN extracts key words by a query attention model, to suppress the noise in the query and to highlight the desired objects. Moreover, a key-word-aware visual context model is used to learn the relationships of multiple visual objects based on the nature language query, which is important to localize and recognize objects. Our method outperforms state-of-the-art methods on two common referring expression image segmentation databases. In the future, we plan to improve the capacity of the network to tackle objects of different sizes.