Keywords

1 Introduction

Image is a very convenient tool to store and demonstrate visual information. Hence, how to query and obtain wanted images from giant image datasets is an attractive research topic both academically and industrially [2, 19, 20]. But since the image belongs to a kind of unstructured information, image retrieval is never an easily conducted task. Furthermore, in the past few decades, accompanied by the rapid development of image capturing and collecting techniques, the number of available images is booming astonishingly. Hence how to effectively acquire wanted images from tremendous candidates is attracting more and more attentions.

Eyes are the most important facial visual features, and image retrieval based on eyes is a long history interesting research topic. It has wide practical value and application significance in the fields of public safety, blind date matching, beauty retouching, and so forth. However, due to their quite limited region area, the unique identity information that could be conveyed by eyes hints is seriously restricted, and hence, by this information source alone, accurate image retrieval is hardly achievable. In order to deal with this problem, in this paper, we designed a multimodal image retrieval method based on eyes hints combined with facial description properties. In our approach, both image and text information are unified as the query source, hence more effective identity information can be utilized to guide the searching procedure. On the other hand, the introduction of description properties can not only directly improve the accuracy of image retrieval, but also introduce more personality and customization. The users can perform customized retrieval by alternating the text facial descriptions, as shown in Fig. 1. Here it can be observed that the image query procedure is customized by the yellow and green background facial description properties respectively.

Fig. 1.
figure 1

A demonstration of the image retrieval based on both eyes hints and facial description properties. Here, both images of the rightmost column are the retrieval output, where the top one is based on the far left eyes image and the yellow and blue background facial description properties, while the bottom one is on the same eyes image and the blue and green background properties. (Color figure online)

In recent years, there have been some research findings involving the combination of vision and text information  [11]. But most of them focus on visual question answering [4, 9], cross-modal retrieval and image captioning [7, 27]. The works that are directly related with the multimodal image retrieval specifically discussed in this paper are relatively rare.

In order to solve the mentioned problem, we propose a novel image and property information fusion mechanism. After obtaining the high-level semantic features of the eye images and the description properties respectively through the encoders, we do element wise addition and concatenation between different types of features, then product the results to achieve effective multimodal information fusion. This proposed method is named the Product of Addition and Concatenation (or PAC for short). We will give the detailed introduction in Sect. 3. The approach is experimented on the public available CelebA dataset, and satisfactory performance is achieved.

To summarize, the main contribution of this paper is threefold:

  1. (1)

    We propose a multimodal image retrieval method based on eyes hints and facial description properties.

  2. (2)

    A novel cross-modal feature fusion mechanism is introduced which could effective combine both vision and text information.

  3. (3)

    The multimodal information fusion capacity of deep neural network is beneficially explored.

The rest parts of the paper are organized as follows: Sect. 2 reviews related work; The proposed PAC based image retrieval method will be presented in Sect. 3; Validation experiments are shown in Sect. 4; Sect. 5 concludes the paper.

2 Related Work

Vision Question Answering (VQA). VQA is a typical application that combines both image and text information. Its goal is to automatically generate the natural language answers according to the image and natural language question input. There are generally two ways to combine features in VQA, one of which is the direct combinations, such as, concatenation, element wise multiplication, and element wise addition  [14]. Zhou et al.  [28] introduce to use the bag-of-words to express the questions, the GoogleNet to extract visual features, and then direct connect the two features. Agrawal et al.  [1] worked out to input the product of two feature vectors into a multi-layer perceptron of two hidden layers. Another way to combine features is using bilinear pooling or related schemes in a neural network framework  [14]. Fukui et al.  [8] proposed to the Multimodal Compact Bilinear (MCB) pooling to combine image features and text features. Due to the high computational cost of MCB, a multimodal low-rank bilinear pooling strategy (MLB) is worked out, where the Hadamard product and linear mapping are used to achieve approximate bilinear pooling  [15].

Cross-Modal Retrieval. Cross-modal retrieval uses a certain modal sample to search for other modal samples with similar semantics. The traditional method generally obtains the mapping matrix from the paired symbiotic information of different modal sample pairs, and maps the features of different modalities into a common semantic vector space. Li et al.  [17] introduced a cross-modal factor analysis method. Rasiwasia et al. [13, 21] designed to apply the canonical correlation analysis (CCA) to the cross-modal retrieval between text and images. In recent years, many researches use deep learning to extract the effective representations of different modalities at the bottom layer, and establish semantic associations of different modalities at the top layer [6, 16]. Wei et al.  [26] worked out an end-to-end deep canonical correlation analysis method to retrieve text and images. Gu et al.  [10] enrolled the Generative Adversarial Networks and Reinforcement Learning for cross-modal retrieval. In their work, the generation process is integrated into the cross-modal feature embedding. Here, not only the global features can be learned but also the local features. Wang et al.  [24] believed that the previous methods rarely consider the interrelationship between image and text information during calculating the similarity, so they proposed the Cross-modal Adaptive Message Passing (CAMP) method.

Metric Learning.The goal of metric learning is to maximize the inter-class variations while minimize the intra-class variations, and it is quite common in pattern recognition applications. In neural network based approaches, LeCun et al.  [12] designed the contrastive loss to increase inter-class variations. Schroff et al.  [22] proposed the triplet loss. Then a large number of subsequent metric learning methods are worked out based on the triplet loss, such as the quadruplet loss  [5].

3 Method

As mentioned in the introduction section, our goal is to achieve multimodal image retrieval based on both eyes hints and facial description properties. Here, how to effectively combine the query information coming from distinct categories is the most critical problem. Since both vision and text information are complex and comprehensive, we designed a neural network based information fusing and processing strategy. The main training pipeline is demonstrated in Fig. 2.

Fig. 2.
figure 2

The training pipeline of our multimodal image retrieval method based on eyes hints and facial description properties.

Specifically, first, the query eyes image x is encoded by a Light CNN  [25]. Light CNN is a light-weight, noise-removable network proposed for face recognition. Here the query eyes image is transformed into 2D spatial feature vector \(f_{\text {img}}({{\textit{\textbf{x}}}})=\phi _{{{\textit{\textbf{x}}}}}\in \mathbb {R}^{W\times H\times C}\), where W is the width, H is the height, and C = 512 is the number of feature channels. Note that we modify the size of the last fully connected layer of Light CNN from 256 to 512 to make the number of channels of image and text features the same. Second, we encode the facial description properties t with LSTM  [28]. We define \(f_{\text {text}}({{\textit{\textbf{t}}}})=\phi _{{{\textit{\textbf{t}}}}}\in \mathbb {R}^{L\times S\times d}\) to be the hidden state at the final time step, where L is the sequence length, S is the batch size, and d = 512 is the hidden layer size. Finally, both \(\phi _{{{\textit{\textbf{x}}}}}\) and \(\phi _{{{\textit{\textbf{t}}}}}\) are combined into \(\phi _{{{\textit{\textbf{xt}}}}}=f_{\text {combine}}(\phi _{{{\textit{\textbf{x}}}}}, \phi _{{{\textit{\textbf{t}}}}})\) with the proposed PAC method, which will be introduced in Sect. 3.1 in details.

On the other hand, during image retrieval, we calculate the similarity of the fused feature and that of candidate images by cosine distance, and then sort to get the face images that best meets the query conditions.

3.1 Feature Fusion by PAC

In order to effectively achieve multimodal information fusion, we explored a comprehensive combination strategy. Since the element wise addition and catenation are the most common direct way of fusion, our first glance is to contain both operation continuously. But because the dimensions of both information may be different, a co-dimensionalization approach is worked out. In detail, a convolution operation is enrolled to adjust the latitude of the concatenated feature matrix. At the same time, the sigmoid function is introduced to avoid taking too large a value. After the co-dimensionalization, the two types of combination methods are fused again by bitwise multiplication to obtain the final fusion feature. The whole mentioned processing is named the Product of Addition and Concatenation. Specifically,

$$\begin{aligned} \phi _{{{\textit{\textbf{xt}}}}}=f_{\text {add}}(\phi _{{{\textit{\textbf{x}}}}}, \phi _{{{\textit{\textbf{t}}}}})\odot f_{\text {concat}}(\phi _{{{\textit{\textbf{x}}}}}, \phi _{{{\textit{\textbf{t}}}}}), \end{aligned}$$
(1)

where \(\odot \) is element wise product, \(\phi _{{{\textit{\textbf{x}}}}}\) denotes image feature, \(\phi _{{{\textit{\textbf{t}}}}}\) is text feature, \(\phi _{{{\textit{\textbf{xt}}}}}\) is the fused feature.

$$\begin{aligned} f_{\text {add}}(\phi _{{{\textit{\textbf{x}}}}}, \phi _{{{\textit{\textbf{t}}}}})=W _{\text {img}}\phi _{{{\textit{\textbf{x}}}}}+W _{\text {text}}\phi _{{{\textit{\textbf{t}}}}}, \end{aligned}$$
(2)

where \(W _{\text {img}}\), \(W _{\text {text}}\) are learnable weights to balance both components.

$$\begin{aligned} f_{\text {concat}}(\phi _{{{\textit{\textbf{x}}}}}, \phi _{{{\textit{\textbf{t}}}}})=\sigma (W_{g}\circ [\phi _{{{\textit{\textbf{x}}}}}, \phi _{{{\textit{\textbf{t}}}}}]), \end{aligned}$$
(3)

where \([\phi _{{{\textit{\textbf{x}}}}}, \phi _{{{\textit{\textbf{t}}}}}]\) is matrix concatenate, and \(\sigma \) denotes the sigmoid function. We define \(W_{g}\circ \) to be a series of convolution operations: first, the concatenated matrix is normalized in batches, then goes through the ReLU activation function, and finally its the number of channels is reduced from 1024 to 512 through the fully connected layer.

3.2 Loss Function

Clearly, the goal of our training is to make the fused features of faces in the same identity and state closer, while pulling apart the features of distinct images. For this task, we employ a triplet loss, which contains anchors, positive samples, and negative samples. When selecting the triples, we choose negative samples that are the same identity but different states as the anchor in order to enable that the network can distinguish the differences in face states. We define this triple as \(T_{\text {state}}(f(x_{i}^{a}),f(x_{i}^{p}),f(x_{i}^{n}))\), where \(x_{i}^{a}\) is anchor, \(x_{i}^{p}\) is positive sample, and \(x_{i}^{n}\) is negative sample. f(x) is embedding constrained to live on the d-dimensional hypersphere  [22], where \(d=512\), i.e. \(\left\| f(x) \right\| _{2}=1\). Similarly, we define \(T_{\text {identity}}(f(x_{j}^{a}),f(x_{j}^{p}),f(x_{j}^{n}))\) to denote a triple whose negative sample and anchor have different identities but similar status. We then use the following triplet loss:

$$\begin{aligned} \begin{aligned} L&= \sum _{i}^{N_{\text {state}}}\left[ \left\| f(x_{i}^{a})-f(x_{i}^{p}) \right\| _{2}^{2}-\left\| f(x_{i}^{a})-f(x_{i}^{n}) \right\| _{2}^{2}+\alpha \right] _{+} \\&+ \sum _{j}^{N_{\text {identity}}}\left[ \left\| f(x_{j}^{a})-f(x_{j}^{p}) \right\| _{2}^{2}-\left\| f(x_{j}^{a})-f(x_{j}^{n}) \right\| _{2}^{2}+\alpha \right] _{+}, \end{aligned} \end{aligned}$$
(4)

where \(\alpha \) is a margin that is enforced between positive and negative pairs. \(N_{\text {state}}\) is the number of \(T_{\text {state}}\), and \(N_{\text {identity}}\) is the number of \(T_{\text {identity}}\). We believe that splitting loss into two parts, status and identity, is beneficial for the network to combine two types of data.

4 Experiments

In this section, the proposed multimodal image retrieval approach based on eyes hints and facial description properties will be experimented both quantitatively and qualitatively.

4.1 Experiment Configurations

Datasets. The experiments are conducted on a widely used face attribute dataset CelebA  [18], which contains 202,599 images of 10,177 celebrity identities, and each of its image contains 40 attribute tags. Our experiments utilize 35 of them which do not related with eyes. As shown in Fig. 1, the query eyes images are in a single rectangle shape. We adopt the same subset division introduced in [18], where 40,000 images constitute the testing set, while all the left images form the training set.

Implementation Details. The experiments are realized by PyTorch code. The training is run for 210k iterations with a start learning rate 0.01.

Evaluation Metrics. As to the performance evaluation, we use the most commonly used evaluation metric R@K, which is the abbreviation for Recall at K and is defined as the proportion of correct matchings in top-k retrieved results  [24]. Specifically, we adopt the R@1, R@5, R@10, R@50 and R@100 as our evaluation metrics.

Table 1. Image retrieval performance on the CelebA dataset.

4.2 Quantitative Results

In order to objectively evaluate the method performance, several classical information fusion approaches are enrolled as comparison. The MLB is a very classic multimodal fusion method in the field of VQA, which is based on the Hadamard product [15]. The MUTAN is a fusion method based on tensor decomposition applied in the field of VQA  [3]. The TIRG method converts multimodal features into two parts called gating and the residual features, where the gating connection uses the input image features as the reference for the output composite features, and the residual connection represents the modifications in the feature space  [23].

In order to fairly compare their performance with that of ours, during experiments, only the feature fusion part is distinct, while all other components are exactly the same as that of ours.

Table 1 presents the detailed performance. It can be observed that our method is evidently better than other methods on each evaluation indicator. Here, one thing to be mentioned is that, since facial description is not a kind of unique identity information, while the ground truth image used as the evaluation benchmark is unique, the overall performance may not be quite impressive. But this is caused by the nature of this task, rather than the method adopted.

Fig. 3.
figure 3

A few image retrieval outputs on the CelebA dataset. In the green dotted frame of the first column, the eyes image and the description are the query condition. The next five columns are the top five search results obtained by our method, where the ground truth image is surrounded by a solid green border. (Color figure online)

4.3 Qualitative Results

A few of retrieved images are shown in Fig. 3. It can be observed that generally the top five worked out candidates all conform to the query eyes and facial properties. Let’s take the first row as an example. It can be easily found that all the five images are “Attractive, Big Nose, Black Hair, Heavy Makeup, High Cheekbones, Mouth Slightly Open, No Beard, Receding Hairline, Smiling, Wearing Earrings, Wearing Lipstick, Wearing Necklace, Young”, while their eyes are similar with the query eyes to some extent.

Fig. 4.
figure 4

A few of “unsuccessful” retrieval output on the CelebA dataset. In the green dotted frame of the first column, the eyes image and the description are the query condition. The next five columns are the top five search results obtained according to our method. The images surrounded by solid green border in the rightmost column are the corresponding ground truth images. (Color figure online)

Figure 4 shows some “unsuccessful” retrievals, which means the ground truth image is not in the top five worked out images. It can be observed that, even in those “unsuccessful” retrievals, the obtained images are still compatible with the query eyes and description properties.

On the other hand, Fig. 5 shows some retrieval output according to the same eyes but distinct description properties. It can be seen that the text query condition can directly influence the retrieval output. Hence, it can be claimed that customized image retrieval is achievable by our method.

Fig. 5.
figure 5

Image retrieval outputs with the same eye but distinct description properties as the input. The image of the eyes in the dotted frame in the first column is the query image, and the description is the query text. Note that attribute words marked in red are unique. The next five columns are the top five search results obtained according to our method. The images surrounded by solid green border in the rightmost column are the corresponding ground truth images. (Color figure online)

In addition, we calculated the R@1 of each description properties as well, see Fig. 6. Among them, the average is 0.6874. It can be seen that “Male”, “No Beard”, “Mouth Slightly Open” have the highest recall rate in our model, while “Wearing Necktie”, “Blurry”, “Bald” have relatively lower recall rate. This phenomenon is identical with our intuition because those latter properties are relatively rare in the candidate dataset.

Fig. 6.
figure 6

The accuracy of identifying each properties.

5 Conclusion

In this paper, we propose to use the description properties as supplementary information for eye images in image retrieval tasks. And a novel multimodal information fusion method is worked out for the mission. The effectiveness of the proposed method is verified on a public available dataset, the CelebA. Generally the performance is identical with our expectation. In addition, personalized and customized image retrieval is achievable by the proposed approach. In the near future, we would like to try to extend this method to more general image retrieval problems.