Keywords

1 Introduction

Referring expression comprehension (REC) [17] aims at locating a specific object within a scene given a natural language expression. It is a fundamental issue in the field of human-computer interaction and also a bridge between computer vision and natural language processing. Although referring expression comprehension has achieved tremendous progress, most of today’s REC models ignore the scene texts in images. However, scene text is indispensable and more natural for distinguishing different objects. Considering the situation in Fig. 1, it is difficult to detect the target man using basic visual attributes, since the players wear the same uniform and their position is constantly changing during a football match. But the target man can be easily and naturally detected with the guidance of scene text.

Scene text is ubiquitous in our society, which conveys rich information for understanding the visual scene [27]. As the COCO-Text dataset [36] suggests, about 50% of the images contain scene text in large-scale datasets such as MS COCO [24] and the percentage increases sharply in urban environments. To move towards human-oriented referring expression comprehension, it is necessary to integrate scene text in existing REC pipelines. Scene text can provide more discriminative information so that the target object can be more easily specified. For example, “get a bottle of Coca-Cola from the fridge” is more precise for a robot to find the target object and more user-friendly. In literature, there are many studies successfully using scene text for vision-language tasks, e.g., visual question answering [34], image captioning [33], cross-modal retrieval  [27, 37], and fine-grained image classification [16]. Therefore, explicitly utilizing scene text should be a natural step toward a more reasonable REC model.

Fig. 1.
figure 1

This paper introduces a novel dataset to study integrating scene text in the referring expression comprehension task. For the above example, the scene text “8” provides crucial information that naturally distinguishes different players.

To study how to comprehend scene text associated with objects in images, we collect a new dataset named TextREC. It contains 24,352 referring expressions and 36,083 scene text instances on 8,690 images, and most of the referring expressions are related to scene text. Our TextREC dataset challenges a model to recognize scene text, relate it to the referring expressions, and choose the most relevant visual object, requiring semantic and visual reasoning between multiple scene text tokens and visual entities. Besides, we also evaluate the performance of different state-of-the-art REC models, from which we observe the limited performance due to ignoring the scene texts contained in images. To this end, we propose a Text-guided Adaptive Module Network (TAMN) to address this issue. The contributions of this paper are threefold:

  • We introduce a novel dataset (TextREC) in which most of the referring expressions are related to scene text. Our TextREC dataset requires a model to leverage the additional modality provided by scene text so that the relationship between the visual objects in images and the textual semantic referring expression can be identified properly.

  • We propose a text-guided adaptive modular network (TAMN) to utilize scene text, relate it to the referring expressions, and select the most relevant visual object.

  • Substantial experimental results on the TextREC dataset demonstrate that it is important and meaningful to take into account scene text for locating the target object, meanwhile demonstrating the excellent performance of our TAMN in this task.

2 Related Work

2.1 Referring Expression Comprehension Datasets.

To tackle the REC task, numerous datasets [3, 25, 28, 39, 42, 45] have been constructed. The first large-scale REC dataset was introduced by Kazemzadeh et al. [17], which is collected by applying a two-player game named ReferIt Game on the ImageCLEF IAPR [8] dataset. Unlike ReferIt Game, RefCOCOg [28] is collected in a non-interactive setting based on the MSCOCO [24] images. RefCOCO [45] and RefCOCO+ [45] are also collected using ReferIt Game on the MSCOCO images. Due to the non-interactive setting, the referring expressions in RefCOCOg are longer and more complex than those in RefCOCO and RefCOCO+. The above datasets are collected in real-world images. While Liu et al. [25] consider using synthesized images and carefully design templates to generate referring expressions, resulting in a synthetic dataset named CLEVR-Ref+. Wang et al. [39] point out that commonsense knowledge is important to identify the objects in the images in our daily life. They also collect a dataset based on Visual Genome [18], named KB-Ref. To answer each referring expression, at least one piece of commonsense knowledge should be included. Chen et al. [3] and Yang et al. [42] adopt the expression template and scene graphs provided in [11, 18] to generate referring expressions in the real-world images. Recently, Bu et al. [2] collect a dataset based on various image sources to highlight the importance of scene text.

2.2 Vision-Language Tasks with Text Reading Ability

With the maturity of reading scene text (OCR) [6, 19,20,21,22,23, 26, 31, 32, 41, 46], vision-language tasks with text reading ability become an active research field. Several existing datasets [1, 29, 30, 34, 35, 40] study the task of Visual Question Answering with Text Reading Ability. These datasets require understanding the scene text in the image when answering the questions. Similarly, to enhance scene text comprehension in an image, a new task named image captioning with reading comprehension and a corresponding dataset called TextCaps [33] is proposed.

Existing works [7, 10, 15, 33, 34, 38, 47] propose various network architectures to utilize scene text information. LoRRA [34] adds an OCR attention branch to the VQA model [13], to select an answer either from a fixed vocabulary or detected OCR tokens. M4C [10] utilizes a multi-modal Transformer encoder to encode the question, image and scene text jointly, then generates answers through a dynamic pointer network. M4C-Captioner [33] directly remove question input in the aforementioned M4C model to solve the text-based image captioning task. SA-M4C [15] proposes a spatially aware self-attention layer that ensures each input focuses on a local context rather than dispersing attention amongst all other entities in a standard self-attention layer. MM-GNN [7] utilizes graph neural networks to build three separate graphs for different modalities. Then the designed aggregators use multi-modal contexts to obtain a better representation for the downstream VQA. SSBaseline [47] designs three simple attention blocks to suppress irrelevant features. LSTM-R [38] constructs the geometrical relationship between OCR tokens through the relation-aware pointer network.

3 TextREC Dataset

Our dataset enables referring expression comprehension models to conduct spatial, semantic, and visual reasoning between multiple scene text tokens and visual objects. In this section, we describe the process of constructing our TextREC dataset. We start by describing how to select the images used in TextREC. We then explain the pipeline for collecting the referring expressions related to scene texts. Finally, we provide statistics and an analysis of our TextREC.

Fig. 2.
figure 2

Number of annotated instances per category.

3.1 Images

In order to make full use of the annotations of existing datasets, we rely on the MSCOCO 2014 train images (Creative Commons Attribution 4.0 License). Since the goal of our dataset is to integrate scene text in existing REC pipelines, we are more interested in the images that contain scene texts. To select images containing scene texts, we use COCO-Text [36], which is a scene text detection and recognition dataset based on the MSCOCO dataset. We select images containing at least one non-empty legible scene text instance. Through the visualization of the result images, we notice that some scene text instances are too small and difficult to recognize. So we further add a constraint to the images to filter out the scene text instances with an area smaller than 100 pixels. Filtering these out results in 10,752 images, which form the basis of our TextREC dataset.

Fig. 3.
figure 3

Wordcloud visualization of most frequent scene text tokens contained in the referring expressions.

3.2 Referring Expressions

In the second stage, we collect referring expressions for objects in the above images. Different from the traditional referring expression comprehension task, in most cases, the target object can be uniquely specified with scene text. For example, if we want to ground NO.13 player in a football match, only using the number 13 on the player’s clothes is sufficient. So in the referring expressions, we want to include scene text as much as possible, ignoring appearance information and location information. As a result, we choose some simple templates to generate referring expressions. We can get the bounding box of each object based on MSCOCO annotations. According to the bounding box, we can find the scene text instances contained in this bounding box. For each selected scene text, we generate referring expressions using two templates: “The object with <OCR string> on it” and “The <category name> with <OCR string> on it”. Among these templates, <OCR string> will be replaced by the scene text instance in the images, and <category name> will be replaced by the category name of the object. However, the referring expressions generated through the two templates may not refer to the corresponding objects. The reason is that the scene text instance is contained in the object’s bounding box but irrelevant to the object. As shown in Fig. 7, the scene text instances are contained in the red bounding boxes, but irrelevant to the corresponding objects. To address this issue, we develop an annotation tool using Tkinter to check the plausibility of each referring expression. Finally, we manually filter out 48,704 valid referring expressions from 61,000 expressions.

3.3 Statistics and Analysis

Our TextREC dataset contains 8,690 images, 36,083 scene text instances, 10,450 annotated bounding boxes belonging to 50 categories and 48,704 referring expressions (each template has 24,352 referring expressions). We also compare our TextREC dataset with standard benchmarks in the referring expression comprehension task. As shown in Table 1, our dataset is the only benchmark containing both scene text related expressions and scene text annotations.

We also analyze the number of annotated instances per category to see which categories are most likely to contain scene texts. The top-20 categories and their corresponding instance numbers are shown in Fig. 2. It can be observed that the category of person is most likely to contain scene texts. This is not surprising since people usually wear clothing with various logos such as “nike” or “adidas”. The category of the vehicle also tends to contain scene texts. The bus often indicates its route using some characters, and the airplane also indicates which airline it belongs to.

Table 1. Comparison between standard benchmarks and the proposed TextREC.

Moreover, we visualize word clouds for the scene text tokens contained in the referring expressions. As shown in Fig. 3, most scene text tokens are meaningful. The most frequent word is “stop” since one category of MSCOCO is stop sign. The second most frequent word is “police” because police vehicles appear frequently in our dataset.

4 Method

In this section, we introduce our Text-Guided Adaptive Modular Network (TA-MN) to align the referring expressions with the scene texts. The overall framework is shown in Fig. 4. Given a referring expression r and a candidate object \(o_i\) as input, where i represents the i-th object in the image, we start with the language attention network to parse the expressions into the subject module and the text-guided matching module. Then we use the text-guided matching module to calculate a matching score for \(o_i\) with respect to the weighted referring expression r. Finally, we take this matching score along with the score from the subject module proposed in MAttNet [44]. The overall matching score between \(o_i\) and r is the weighted combination of these two scores.

Fig. 4.
figure 4

Our model learns to parse an expression into the subject module and the text-guided matching module using the language attention network. Then computes an individual matching score for each module. For simplicity, we refer to the text-guided matching module as the OCR module for short.

4.1 Language Attention Network

Similar to CMN [9] and MAttNet [44], we utilize the soft attention mechanism over the word sequence to attend to the relevant words automatically. As shown in Fig. 5, given a expression of T words \(r = \left\{ m_{t}\right\} _{t=1}^{T}\), we first embed each word \(m_{t}\) to a vector \(e_{t}\) using an one-hot word embedding. Then a bi-directional LSTM is applied to encode the context for each word. To get the final representation for each word, we concatenate the hidden state in both directions:

figure a

The attention weight over each word \(m_t\) for the text-guided matching module is obtained through a learned linear prediction over \(h_t\) followed by a softmax function:

$$\begin{aligned} a_{t}=\frac{\exp \left( \text{ FC } ( h_{t} \right) )}{\sum _{k=1}^{T} \exp \left( \text{ FC } ( h_{k} ) \right) } \end{aligned}$$

The language representation of the text-guided matching module is obtained by the weighted sum of word embeddings:

$$\begin{aligned} \begin{aligned} q^{ocr}&=\sum _{t=1}^T a_{t}e_t\\ \end{aligned} \end{aligned}$$

Finally, we utilize another two fully-connected layers to get the weights \(w_{ocr}\) and \(w_{subj}\) for our text-guided matching module and subject module:

$$\begin{aligned}{}[w_{ocr}, w_{subj}] = \text{ softmax }(\text{ FC } ( [h_0, h_T] )) \end{aligned}$$
Fig. 5.
figure 5

The illustration of the language attention network.

4.2 Text-Guided Matching Module

Our text-guided matching module is illustrated in Fig. 6. Given a candidate \(o_i\) and all the ground truth scene text instances \(\left\{ p_{n}\right\} _{n=1}^{N}\) contained in the bounding box of \(o_i\), we first encode each scene text instance \(p_{n}\) to a vector using the same word embedding layer of the language attention network.

$$\begin{aligned} \begin{aligned} u_n&= \text{ embedding }(p_n) \\ \end{aligned} \end{aligned}$$

Then we compute the cosine similarity between each word embedding of the scene text instance and \(q^{ocr}\):

$$\begin{aligned} \begin{aligned} S(u_{n}, q^{ocr}) =\frac{u^{T}_{n}q^{ocr}}{||u_{n}||||q^{ocr}||} \\ \end{aligned} \end{aligned}$$

The similarity score between \(\left\{ u_{n}\right\} _{n=1}^{N}\) and \(q^{ocr}\) can be obtained by choosing the largest score in \(\left\{ S (u_{n}, q^{ocr}) \right\} _{n=1}^{N}\):

$$\begin{aligned} \begin{aligned} S (u, q^{ocr}) = \max \limits _{1 \le n \le N} S (u_n, q^{ocr})\\ \end{aligned} \end{aligned}$$

This score is not sufficient as the matching score between \(o_i\) and r. We will illustrate the reasons with a few specific examples. As shown in Fig. 7, a scene text instance may exist both in the bounding boxes of two different objects. But it only relates to one object (green box). If we use \(S (u, q^{ocr})\) as the matching score, another unrelated object (red box) may mismatch with the expression. To address this problem, the algorithm should find the association between the scene text and object. For example, “NIKE” is unlikely to appear on a motorcycle, but can appear on a person. So we further add a confidence score to \(S ( u, q^{ocr} )\):

$$\begin{aligned} \begin{aligned} S ( f_{obj}, q^{ocr} ) =\frac{f^{T}_{obj}q^{ocr}}{||f_{obj}||||q^{ocr}||} \\ \end{aligned} \end{aligned}$$
(1)

where \(f_{obj}\) is the visual representation of the candidate object extracted in the subject module. This confidence score can drive the model to learn the association between the scene text and object.

The final matching score of our text-guided matching module can be obtained by multiplying \(S ( u, q^{ocr} )\) with its confidence score:

$$\begin{aligned} \begin{aligned} S(o_i|q^{ocr}) = S ( f_{obj}, q^{ocr} ) S ( u, q^{ocr} )\\ \end{aligned} \end{aligned}$$
Fig. 6.
figure 6

The illustration of proposed text-guided matching module, “conf” refers to the confidence score calculated in Eq. 1.

4.3 Learning Objective

Assume we get \(S(o_i|q^{ocr})\) and \(S(o_i|q^{subj})\) from our proposed text-guided matching module and the subject module proposed in MAttNet [44]. We also get the module weights \(w_{ocr}\) and \(w_{subj}\) for the text-guided matching module and the subject module in the language attention network. The overall matching score for candidate object \(o_i\) and referring expression r is:

$$\begin{aligned} \begin{aligned} S(o_i|r) = w_{ocr} S(o_i|q^{ocr}) + w_{subj}S(o_i|q^{subj}) \end{aligned} \end{aligned}$$

Inspired by the triplet loss for the image retrieval task, for each positive pair \((o_i, r_i)\), we randomly sample two negative pairs \((o_i, r_j)\) and \((o_k, r_i)\). \(r_j\) is the expression matched with other object in the same image of \(o_i\), and \(o_k\) is other object in the same image of \(r_i\). The combined hinge loss is calculated as follows:

$$\begin{aligned} \begin{aligned} L^{overall}_{rank}=\sum _i&\lambda _1 [\delta +S(o_i|r_j)-S(o_i|r_i)]_{+}\\ + \sum _i&\lambda _2 [\delta +S(o_k|r_i)-S(o_i|r_i)]_{+} \end{aligned} \end{aligned}$$

where \(\delta \) is a margin hyper-parameter and \([\cdot ]_{+}=\max (\cdot , 0)\). To stabilize the training procedure, we further add a hinge loss to the text-guided matching module:

$$\begin{aligned} \begin{aligned} L^{ocr}_{rank}=\sum _i&\lambda _3 [\delta +S(o_i|q^{ocr}_j)-S(o_i|q^{ocr}_i)]_{+}\\ + \sum _i&\lambda _4 [\delta +S(o_k|q^{ocr}_i)-S(o_i|q^{ocr}_j)]_{+} \end{aligned} \end{aligned}$$

The final loss function is summarized as follows:

$$\begin{aligned} \begin{aligned} L=L_{rank}^{ocr}+L_{rank}^{overall} \end{aligned} \end{aligned}$$
Fig. 7.
figure 7

The motivation of adding the confidence score in our OCR module.

5 Experiment

In this section, we first introduce the experiment setting. Then we evaluate the TAMN and several state-of-the-art REC methods on our TextREC dataset. Furthermore, we conduct ablation studies to demonstrate the effectiveness of each component in our TAMN. We also explore more templates and a new test setting. Finally, the attention weights for each word in the referring expressions are visualized to demonstrate the effectiveness of the language attention network.

5.1 Dataset and Evaluation Protocol

We evaluate our text-guided adaptive modular network on the TextREC dataset. From Fig. 2, it can be observed that the categories of the dataset follow a long-tailed distribution. To ensure that the test set contains rare categories, we divide our dataset according to the ratio of instances of each category to the total, resulting in train and test splits with image numbers 7,422 and 1,268.

Following the standard evaluation setting [28], we compute the Intersection over Union (IoU) ratio between the ground truth and predicted bounding box. We regard the detection as a true positive If IoU is greater than 0.5, otherwise it is a false positive. For each image, we then compute the precision@1 measure according to the confidence score. The final performance is obtained by averaging these scores over all images.

5.2 Implementation Details

The detection model we adopt is Mask R-CNN. We follow the same implementation as MattNet [44]. The detection model is trained on a union of MSCOCO’s 80k train and 35k subset of val (trainval35k) images excluding the test images in our TextREC dataset. We use the ground truth bounding boxes during training. In the test stage, we utilize the Mask R-CNN mentioned above to generate boxes. Our model is optimized with Adam optimizer and the batch size is set to 15. The initial learning rate is 0.0004. Moreover, the model is trained for 50 epochs with a learning rate decay by a factor of 2 every 16 epochs. The size of the word embedding and the hidden state of the bi-LSTM is set to 512. The size of the word embedding for the scene text is also set to 512. We set the output of all fully-connected layers within our model to be 512-dimensional. For the hyper-parameters in the loss functions, we set \(\lambda _1\) = 1 and \(\lambda _2\) = 1 in \(L_{rank}^{overall}\). In addition, we set \(\lambda _3\) = 1 and \(\lambda _4\) = 1 in \(L_{rank}^{ocr}\).

Table 2. Performance of the baselines on our TextREC dataset. TAMN significantly benefits from scene text input and achieves the highest precision@1 (%) score, suggesting that it is important to integrate scene text for the referring expression comprehension task.

5.3 Performance of the Baselines on TextREC Dataset

To illustrate the gap between the traditional REC datasets and our TextREC dataset, we conduct experiments with different state-of-the-art REC methods. As shown in Table 2, current state-of-the-art methods [4, 14, 43, 44] fall short on our TextREC dataset. The results indicate that these methods ignore scene text in images, while our TAMN gets inspiring results by integrating scene text. This clearly verifies that it is important and meaningful to take into account scene text for the referring expression comprehension task.

Fig. 8.
figure 8

The visualization results of the word attention in the language attention network.

5.4 Ablation Studies

The Subject Module and OCR Module. As shown in Fig. 4, our TAMN consists of two modules: the subject module and OCR module. We test the performance only with each module and the results are shown in Table 3. Compared with the only subject module, adding our OCR module givens 26.4 and 20.2 performance improvement in template1 and template2, respectively. Cooperating with our OCR module, the subject module gives 1.5 and 2.7 performance improvement in template1 and template2, respectively. These verify the effectiveness of the subject module and OCR module. Moreover, for our TAMN, we compute the similarity score of each module to the overall score over the whole test set. The experimental results are summarized as follows: in template1, our OCR module makes the dominating contribution (97.1%) to the overall score. The contribution (2.90%) of the subject module can be ignored. When the expression form transfers to template2, the contribution of our OCR module decreases from 97.1% to 70.0%. While the contribution of the subject module increases from 2.90% to 30.0%. Our OCR module still accounts for the majority. The reason is that scene texts provide more information than the object categories in most cases. These clearly demonstrate the effectiveness of our proposed OCR module.

Table 3. Ablation studies on different modules in our framework. The precision@1 (%) is reported.
Table 4. Ablation studies on different OCR systems. “GT” denotes using grounding truth scene text annotations.

The Confidence Score in Our OCR Module. As shown in Fig. 6, we add a confidence score by calculating the similarity between the RoI feature of the candidate object and the scene text embedding. To verify the effectiveness of this confidence score, we conduct ablation experiments which are shown in Table 5. It can be observed that adding the confidence score gains 1.8% and 3.3% performance improvement in template1 and template2 only using the OCR module. We also test the effectiveness of adding the confidence score in our whole framework. It can be observed that adding the confidence score gains 1.9 and 1.3 performance improvement in template1 and template2. These results clearly verify the effectiveness of adding this confidence score.

Table 5. Ablation studies on the confidence score in our OCR module. The precision@1 (%) is reported.

Different OCR Systems. We conduct ablation studies to see the performance using different OCR systems. Results in Table 4 show that the performance of scene text detection and recognition methods has a great impact on the final results. The reason why EasyOCR has better performance is that the text spotting precision of EasyOCR is 6.8% higher than that of PaddleOCR.

Templates in Different Forms. We conduct ablation studies to see the performance using different templates. As shown in Table 6, it can be observed that the performance is very close with different templates as long as they contain the same amount of information (<category name> or <OCR string>). For example, in row 1, 3, and 5, the performance differences are within 0.3 in terms of precision@1 measure. Similarly, in row 2 and 4, the performance differences are also within 0.3.

Table 6. Ablation studies on the templates in different forms. The precision@1 (%) is reported.

New Test Setting. In the traditional referring expression comprehension datasets, one referring expression only has one corresponding bounding box in an image. However, in our TextREC dataset, one referring expression can have multiple corresponding bounding boxes. For example, we may ask “The object with ‘police’ on it”, there can be more than one police car in the image. It is necessary to find all the objects that match the description. Therefore, we propose a new test setting that calculates the precision, recall, and F1 score. This can be done by setting a threshold on the confidence of all detected bounding boxes. We set 0.75 for template1 and 0.35 for template2 due to their different score distributions. Then we take the selected boxes to match the ground truth bounding boxes to get the true positives, false positives, and false negatives. We test our TAMN on this new setting and the results are shown in Table 7. We believe this new setting can offer more comprehensive evaluations on the models.

Table 7. The performance of our TAMN in the new test setting. The precision, recall and F1-Score (%) are reported.

5.5 Visualization Analysis

To verify the effectiveness of the language attention network. We visualize the attention weight for each word in the referring expressions. As shown in Fig. 8, both the subject module and the OCR module focus on the scene texts in template1. When the expression form transfers to template2, the OCR module still focuses on the scene texts. However, the subject module changes to focus on the category name. For example, in the sentence “the object with ‘15’ on it”, the subject module focuses on the “15”. While, it focuses on the “person” in the sentence “the person with ‘19’ on it”. It is reasonable since the only discriminate information is the scene text in template1.

6 Conclusion

In this paper, we point out that most of the existing REC models ignore scene text which is naturally and frequently employed to refer to objects. To address this issue, we construct a new dataset termed TextREC, which studies how to comprehend the scene text associated with objects in an image. We also propose a text-guided adaptive modular network (TAMN) that explicitly utilizes scene text, relates it to the referring expressions, and chooses the most relevant visual object. Experimental results on the TextREC dataset show that the current state-of-the-art REC methods fail to achieve the expected results, but our TAMN achieves excellent results. The ablation studies also show that it is important to take into account scene text for the referring expression comprehension task.