Approximate Query Matching for Graph-Based Holistic Image Retrieval

Suprem, Abhijit; Chau, Duen Horng; Pu, Calton

doi:10.1007/978-3-319-94301-5_6

Abhijit Suprem¹⁸,
Duen Horng Chau¹⁹ &
Calton Pu¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10968))

Included in the following conference series:

International Conference on Big Data

2067 Accesses

Abstract

Image retrieval has transitioned from retrieving images with single object descriptions to retrieving images by using complex natural language to describe desired image content. We present work on holistic image search to perform exact and approximate image retrieval that returns images from a database that most closely match the user’s description. Our approach can handle simple queries for single objects (ex: cake) to more complex descriptions of multiple objects and prepositional relations between objects (ex: girl eating cake with a fork on a plate) in graph notation. In addition, our approach can generalize to retrieve queries that are semantically similar in case specific results are not found. We use the scene graph, developed in the Visual Genome dataset as a formalization of image content stored as a graph with nodes for objects and edges for relations describing objects in an image. We combine this with approximate search techniques for large-scale graphs and a semantic scoring algorithm developed by us to holistically retrieve images based on given search criteria. We also present a method to store scene graphs and metadata in graph databases using Neo4 J.

You have full access to this open access chapter, Download conference paper PDF

Semantic Bottleneck for Computer Vision Tasks

Large Scale Retrieval and Generation of Image Descriptions

Article 08 July 2015

Graphic-Based Concept Retrieval

Keywords

1 Introduction

Given the explosion of image-based content shared digitally, there has been similar research focus on developing object recognition, image recognition and variants, and image segmentation tools and systems for understanding, identifying, storing, and querying images. The major research thrusts include automatic image captioning, image annotation (including relationship detection), and content-based image retrieval. Automatic image captioning attempts to generate natural language captions for image images that try to represent the holistic meaning of the image. As the saying goes, ‘A picture is worth a thousand words’, and finding the right words is a non-trivial task. Image annotation focuses on object detection combined with relationship detection and grounding to identify canonical relationships within an image. Following [1], these are of the form <subject - predicate - object> . Finally, content-based image retrieval is a wide net; we can consider some recent work from [2], who focus on using automatic image annotation via scene graph grounding to retrieve images similar to a query scene graph.

1.1 Motivation

The motivation for this work comes from the need for holistic image retrieval systems. As [3] notes, text-based queries for image retrieval encode high levels of reasoning and abstraction about an image that is difficult to represent textually. In addition, text-based queries necessarily represent a semantic gap – a disconnect between human semantics and visual features. [2] demonstrates some drawbacks of current text-based image retrieval systems, namely the inability of such systems to abstract queries beyond index-based searches by recognizing the relationships between terms. We show this in Fig. 2, where common image search utilities return results that may not match the given query. We note in the first search engine’s results that the second through fifth results feature a woman eating a cake, while in the second, none of the first three feature a girl eating a cake. Our results, however, return images that contain both features of the query, where possible, at least one (image 4 in our results).

1.2 Overview

An overview of the process is shown in Fig. 1. Given the query, we extract the canonical forms - what we deem to be the smallest query unit. In the Visual Genome database, this is a subgraph of the form <subject - predicate - object> , where each of the subject, predicate, and object are nodes with edges between them labeled with the type of relation (subject-to-predicate or predicate-to-object). We the generate approximates of the canonical forms to broaden our search. In this case, approximates of <girl - eating - cake> include <girl.n.01 - eat.v.01 - patty.n.01> and <girl.n.01 - eat.v.01 - cake.n.03> . Note that these approximates are represented as WordNet synsets as they are sued to search the Visual Genome database, whose scene graphs also represent nodes as WordNet synsets. We retrieve images and ground them to the query – a process during which we identify which portions of the query graph the image satisfies and to what degree – using phrase similarities derived from high dimensional word embeddings (we use LexVec [4] for its state-of-the-art performance on Word Similarity measurements), Finally, we score each image on its grounding and rank the results.

Our work partially bridges the semantic gap by leveraging the idea of a scene graph as presented in the Visual Genome dataset [1]. A key contribution of [1] is the formalization of a scene graph – each image in the Visual Genome Dataset contains a human-generated graph of nodes denoting subjects, objects, and predicates, with the edges denoting the relations between them. The graph annotates the relational content in the image. The dataset also contains region-level and image-level natural language captions, along with WordNet synsets for objects and relationships. There is a preponderance of research surrounding the scene-graph formalization – Relationship Detection [5], Dense Caption Generation [6], and Scene Graph Generation [7]. We propose leveraging existing capabilities showcased in the above works for an efficient and scalable image retrieval pipeline.

Our contributions include:

Query Approximation and Ranking.

We present a model for performing approximate search on a scene graph database. Given a query of the form <subject - predicate - object> (e.g. <subject girl - eating - cake> , we generate approximate queries searching for terms similar to the query terms and evaluating phrases for their similarities to the query. Using this, we can generate, for the given example, query approximates that include woman - eating - cake, person - eating - cake, girl - eating - patty , and woman - eating - patty . In addition, we present an algorithm for grounding and evaluating retrieved scene graphs to the query graph for holistic image ranking.

Aggregate Graph Representations.

We present several graph database representations that are used in generating plausible query approximates by reducing the search space to the current semantic context. Specifically, we show how to create three aggregate graph representations that each index a different query node type: (i) a Subject Aggregate Graph (SAG) for obtaining subject approximates, (ii) an Object Aggregate Graph (OAG) for obtaining object approximates, and (iii) a Predicate Aggregate Graph (PAG) for obtaining predicate approximates, all in the current semantic context.

1.3 The Image Retrieval Pipeline

Figures 1 and 3 summarize our approach. We now focus on the pipeline in Fig. 3. Given a natural language graph query (e.g. a small scene graph as considered by [2]), we reduce the query to its canonical forms (Sect. 4.1); we define a canonical form as the basic unit of query consisting of a subject, predicate, and object of the form <subject - predicate - object> . We then generate approximate queries using WordNet synsets; we retrieve the set of candidate images that contain these approximates and ground the images’ scene graphs to the query. We then use word embeddings and project each canonical form to the vector space and score each image’s scene graph to the query graph, penalizing images with missing or inexact results.

Three sample queries of varying complexity are presented in Fig. 4. Our approach handles both simple and complex queries. Our grounding ensures images that most closely match the entire query are ranked higher than images that offer a partial match. In addition, we bridge the ‘semantic gap’ by searching for approximates. This allows us to return images that closely match the provided query in situations where an exact match may not be possible; an example is shown in Fig. 5: there is no image that exactly matches the query; however, our approach approximates woman into girl, and finds images that contain most of the query: a girl eating cake, with the cake on a plate, and a fork on the plate as well. While the highest ranked image in the figure does contain cake with frosting, this is a happy coincidence; the scene graph itself does not contain the annotation, so we say the image mostly matches the query. We also show the second through fifth ranked images and their matches scene subgraphs.

2 Related Work

Content-Based Image Retrieval.

Content-based image retrieval involves retrieving and ranking images given either a text or image query. The former can be used to retrieve images that incorporate some of the query requirements, while the latter can be used to retrieve images relevantly similar to the query – here evaluation is context-specific, e.g. whether images have similar objects, similar colors, similar poses, or whether the images are exactly the same. Some implementations include Google Reverse Image Search and TinEye. Our focus is on the former approach, as we feel text-based queries allow users more freedom in specifying images to retrieve. It is not the case that a user always has a sample of the image she would like, or can sketch a faithful representation in, e.g., sketch-based image retrieval. [2] ’s key contribution is as follows: the authors develop a framework for generating accurate groundings of a query scene graph (either complete or partial) and use grounding likelihood to rank images for retrieval and display.

Semantic Similarity.

An integral aspect of our image retrieval pipeline is approximate query generation. We work with canonical relationship phrases of the form <subject - predicate - object> to generate query approximates. As the Visual Genome dataset maps each object to available WordNet synsets, we incorporate WordNet based semantic similarity measures. [8] proposes a domain-specific corpus-based training method to identify correct word sense and derives more accurate cosine-similarity measures between source and target words. [9] shows a simple baseline for WordNet synset similarity using vector embedding cosine similarity averages. [10] shows a sentence-based similarity measure that uses a TF-IDF analogue to compute similarities between a source word and its synset lemmas.

3 Graph Databases

As noted, [1] formalizes the scene graph – an image representation using human annotations on bounded regions that grounds objects, relationships, and attributes. Each object (usually a noun) is considered a node with a directed edge towards a relation (a predicate with a part-of-speech tag of verb, preposition, or action), The predicate may or may not have a directed edge towards another object instance. In addition, objects also have attributes (usually adjectives, but may also include actions). We will henceforth consider each object node as a noun with two forms: subject or object . Note that object and object operate under separate domains; however, notation confusion suggests these terms as an appropriate choice. A canonical form is a set of three nodes and two edges of the form <subject → predicate → object> : this triplet indicates a base query that we use as the smallest query unit (i.e. the canonical form).

3.1 Full Scene Graphs

We generate the full scene graph for each image using the scene_graphs from the Visual Genome dataset. The full scene graph is stored in a Neo4J database. Each full scene graph consists of at least one subgraph with at least one canonical triplet of the form <subject - predicate - object>. We note that there may be multiple independent subgraphs corresponding to different regions in the image.

3.2 Object Aggregate Graph

Our query approximation requires us to generate the candidate subjects, objects, and predicates for each top-level query. We develop a novel aggregate characterization of the aggregate graph that is introduced in [1] – we maintain a unique index of <subject → predicate> pairs, and for each pair, we maintain a unique list of objects that are associated with that subject → predicate pair. Objects are not unique across each subgraph in the aggregate graph: apple.n.01 may appear in multiple <subject → predicate> subgraphs; however, within a subgraph, each WordNet synset occurs once.

With the Object Aggregate Graph, we can, given a subject → predicate pair as well as several candidate objects , reduce the set of candidate objects to plausible candidates for that subject → predicate . Given candidate objects $ C_{O} $ and child objects $ K_{O} $ in the Object Aggregate Graph, we return $ C_{O} \cup K_{O} $. Figure 6 shows a subgraph within our Object Aggregate Graph.

3.3 Subject Aggregate Graph

To generate candidate subjects, we devise a Subject Aggregate Graph: we maintain a set of unique predicate → object pairs, and for each pair, we keep a unique set of subjects that are associated with that pair. So, given a predicate → object pair, and candidate subjects $ C_{S} $, we reduce it to the plausible set of subjects by returning $ C_{S} \cup K_{S} $ (where $ K_{S} $ are the parent objects in each subgraph in the Subject Aggregate Graph). We show this in Fig. 7.

3.4 Predicate Aggregate Graph

We adapt [1] ’s aggregate graph here; however, we include all nouns and predicates from the scene graphs, instead of the top-k nouns and predicates. Given a set of candidate subjects and candidate objects , we can obtain the set of plausible predicates, and return the union of the candidate and plausible predicates. We show a subgraph in Fig. 8.

4 Query Matching and Image Retrieval

4.1 Canonical Form

The canonical form of a query is a triplet (Fig. 4a) of the form <subject → predicate → object> . Given a graph query, we extract canonical forms from the query and operate on each independently, within scope of the query subgraph. Once again, we refer to Fig. 4c – this query consists of two independent subgraphs: one about a girl by a man, with the girl wearing a skirt and one about a truck on grass . We split the first query into <girl → wears → skirt> and <girl → by → man> . Similarly, we extract from truck on grass the triplet <truck → on → grass> .

We perform this triplet extraction under an independence assumption – that we can extract individual triplets, the combine them later to obtain the final rankings. This allows us to operate independently on each triplet and its set of query approximates.

4.2 Approximate Generation and Retrieval

Given a triplet, we obtain the WordNet approximates for the subject, predicate, and object. There are three levels of approximates – obtaining the sister synsets, the child synsets (hyponymy), and the parent synsets (hypernymy). The sister synsets are obtained by performing a lookup in the WordNet database of our natural language node label (i.e. ‘girl’ or ‘skirt’ or ‘wears’). there are four scopes available for synset lookup: we can take (i) just the sister synsets, (ii) the sister and child synsets, (iii) the sister and parent synsets, (iv) or all three hierarchies: sister, child, and parent synsets. We limit our choice for candidate subjects and objects to the sister synsets and for candidate predicates to sister and child synsets. This is to speed up computation time on our local machines; on parallelized clusters, such a limitation is not necessary, and we can use the complete closure of a synset: sisters, children, and parents.

Given the set of approximates for the subject and object, we obtain plausible predicates from the aggregate graph. The predicates are ranked to the provided predicate sister synsets: we measure the Wu & Palmer (WUP) Similarity between the plausible predicates and set of sister predicates, and taker the average similarity score. Here, we take the top 2/3 predicates as our set of predicate approximates. We choose WUP similarity as it weights synset edges by distance in the hierarchy, i.e. semantically similar predicates are ranked higher than synonymy predicates without semantic relation. This allows us to narrow results to more appropriate relations by context.

4.3 Synset Embeddings

For ranking, we prefer to use a Euclidean metric for measuring triplet distance between the query and approximates. However, the WordNet hierarchy does not operate on such a metric. We use a model similar to [9] – we determine the embeddings for each synset in our database following the baseline method for [9], but instead of the sum, we take the average of the embeddings sum to obtain the synset centroid in the embedding space with

$$ v_{s} = \frac{1}{{\left| {L_{S} } \right|}}\sum\nolimits_{{l \in L_{S} }} {v_{l} } $$

(1)

where $ v_{S} $ is the embeddings vector for a synset $ S $, $ L_{S} $ is the set of lemmas for the synset $ S $, and $ V_{l} $ is the embedding for each lemma $ l \in L_{S} $. We use LexVec [4] embeddings as the lookup table for $ v_{l} $ as LexVec has shown state-of-the-art performance in word similarity and analogy. We deal with compound words by averaging the embeddings for the compound word itself and the embeddings sum of its component words. If the compound word does not exist, we take only the embeddings sum of component words: given a compound word $ w = w_{1} , w_{2} , \cdots w_{n} $ (i.e. for $ w = $ christmas tree , $ w\_1 = $ christmas and $ w\_2 = $ tree ), we take $ v_{l}^{{\prime }} = \frac{1}{2}\left( { v_{l} + \sum\nolimits_{{l = l_{1} ,l_{2} , \cdots ,l_{n} }} {v_{{l_{i} }} } } \right) $.

4.4 Image Ranking

Approximate-Triplet Image Retrieval.

After triplet retrieval, we have, for each canonical triplet $ T_{i} $, a set of approximates $ A^{{T_{i} }} = a_{1} , a_{2} , \cdots ,a_{k} $. Each of these approximate is a triplet that is semantically similar to the query triplet. For each approximate $ a_{i} $, we have a set of images $ I_{1} , I_{2} , \cdots ,I_{n} $ that contain the approximate triplet within them. We generate an inverted index of images for each image, we collect all approximates $ a_{k} $ for each triplet $ T_{i} $ (Fig. 9). This allows us to work on an image-by-image basis by scoring each image based on its holistic similarity to the query triplet.

Inverted Index.

Each approximate is a triplet of synsets, and we obtain the synset embeddings using the method in Eq. 1. We also obtain the embeddings for the canonical representations from the embeddings lookup (LexVec, in this case). At this point, we note that each approximate triplet is independent of every other approximate triplet. We further note that each approximate $ a_{i} \in I_{n} $, there may be multiple triplets that match the approximate. Consider an approximate triplet of the form man on grass : an image may contain multiple instances of this approximate, and each is stored in the inverted index. Each of these approximate primitives $ ap_{j} $ contains a grounding of its nodes to nodes in the parent query approximate $ a_{k} $, which itself contains a grounding to its parent triplet $ T_{i} $. As such, for each $ ap_{{i \in \left[ {1,j} \right]}} $, we know what it grounds to in $ T_{i} $. We further note that a single image triplet may appear in multiple primitives $ ap_{j} $ as the parent queries could have similar triplets in multiple subgraphs.

Subgraph Scoring.

We first ‘collapse’ the obtained triplets to their subgraphs $ s_{a} $ – we generate the image subgraphs that contain all approximate primitives within a single top-level query triplet $ T_{i} = s_{1} , s_{2} , \cdots s_{a} $. As such, we reduce all redundant copies of each unique node within the inverted index and connect the independent triplets wherever they share subjects or predicates. This is necessary for the query matching we will perform to rank each image approximate to the provided query. We want a notion of subgraph isomorphism to match out collapsed primitives to our top-level query $ T_{i} $. From Fig. 9, we note that there may be some $ T_{i} $ that do not contain approximates. For each $ s_{a} $, we obtain the cosine similarity between its synset embedding and its ground embedding, which is the embedding of the node in the top-level query $ T_{i} $. The cosine similarity is in $ \left[ {0,1} \right] $, where 1 indicates the highest similarity. For each node, we instead store $ n_{score} = 1 - {\text{cosine}}\_{\text{sim}}\left( {node,query} \right) $. For each missing node in $ s_{a} $, we add a null node with a distance of 1, representing a missing node.

Subgraph Ranking.

With this representation, we can now formulate this as a minimization problem: we wish to select the subgraph with the smallest score. Since each subgraph $ s_{a} \in T_{i} $ may contain a combination of approximates, it is intractable to calculate the global minimum selection of subgraphs. We instead model this as an analogue of vertex cover. By construction, each subgraph contains a unique set of approximates for the top-level query. As such, larger subgraphs are inherently more ‘isomorphic’ to each $ T_{i} $ simple because they contain more grounded nodes.

We then sort our subgraphs by size, and within each set of subgraphs with the same size, we pick the subgraph $ s_{m} $ with the smallest score. This subgraph $ s_{m} $ has a subset of the top-level approximates in $ T_{i} $. We remove these approximates from consideration, and again pick, from the remaining, the largest subgraphs. We repeat this until all top-level queries are satisfied for each top-level subgraph. Each of the selected subgraphs $ S_{m} $’s scores $ S_{i} = score\left( {s_{1} , s_{2} , \cdots ,s_{m} } \right) $ are summed and normalized by the number of nodes to find the score on $ I_{n} $ for query $ T_{i} $. We repeat this for each query in $ I_{n} $ to obtain scores $ S = [S_{1} ,S_{2} , \cdots ,S_{n} ] $ for each query $ T_{i} $.

Image Scoring and Ranking.

We note that for each of the scores $ S = \left[ {S_{1} ,S_{2} , \cdots ,S_{n} } \right] $ is in $ \left[ {0,1} \right] $, with a lower score corresponding to a more similar match for each top-level query. It is straightforward then to represent these scores as an $ i $-dimensional score vector $ \left[ {S_{1} ,S_{2} , \cdots ,S_{i} } \right] $ and measure its Euclidean distance from the origin. This gives us a score for image $ I_{n} $. We then sort the image score to obtain the image ranking under query, where the smallest score is the most relevant image.

5 Conclusions and Future Work

We have presented work on image retrieval using graph based approximate querying and ranking. Representative examples are shown in Figs. 1 and 5, as well as a comparison in Fig. 2. We note from Fig. 2 that our system returns relevant results at higher ranks than two leading Search Engines. We also note from Fig. 5 that even for more complex queries, out system can return results that closely match the provided query. Top ranked results match as much of the query as possible with holistic meaning – we reduce the ‘semantic gap’ by considering relations between objects in the image in lieu of using a document-based that eschews a focus on image relational content.

As such, the scene graph of an image is a key factor of our work. With regard to this, future work is two-fold:

Accurate scene graph generation: There is some work on scene graph generation in [7]. However, the authors note that the performance is subpar. Better state-of-the-art performance in automated scene graph generation from unannotated images would allow creation of image databases at scale. Our system can then be implemented on top of such a database for approximate image retrieval.
Query graph generation: We provide an informal comparison of our results to current search engine results in Fig. 2. However, this is not a robust comparison as search engines accept natural language input while we provide input directly as graph queries. This is due to a major input limitation in the conversion of natural language to scene graphs. Future work would focus on graph query generation from natural language that more closely matches desired human queries, using, e.g. dependency parsing or similar methods.

References

Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.-J., Shamma, D.A.: Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 123(1), 32–73 (2017)
Article MathSciNet Google Scholar
Johnson, J., Krishna, R., Stark, M., Li, L.-J., Shamma, D., Bernstein, M., Fei-Fei, L.: Image retrieval using scene graphs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3668–3678 (2015)
Google Scholar
Liu, Y., Zhang, D., Lu, G., Ma, W.-Y.: A survey of content-based image retrieval with high-level semantics. Pattern Recogn. 40(1), 262–282 (2007)
Article Google Scholar
Salle, A., Idiart, M., Villavicencio, A.: Enhancing the lexvec distributed word representation model using positional contexts and external memory. CoRR (2016)
Google Scholar
Lu, C., Krishna, R., Bernstein, M., Fei-Fei, L.: Visual relationship detection with language priors. In: Leibe, B., Matas, J., Sebe, N., Welling, Max (eds.) ECCV 2016. LNCS, vol. 9905, pp. 852–869. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_51
Chapter Google Scholar
Johnson, J., Karpathy, A., Fei-Fei, L.: Densecap: Fully convolutional localization networks for dense captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4565–4574 (2016)
Google Scholar
Xu, D., Zhu, Y., Choy, C.B., Fei-Fei, L.: Scene graph generation by iterative message passing. In: Computer Vision and Pattern Recognition (CVPR) (2017)
Google Scholar
Patwardhan, S., Pedersen, T.: Using WordNet-based context vectors to estimate the semantic relatedness of concepts. In: Proceedings of the EACL 2006 Workshop: Making Sense of Sense-Bringing Computational Linguistics and Psycholinguistics Together, vol. 1501, pp. 1–8 (2006)
Google Scholar
Khodak, M., Risteski, A., Fellbaum, C., Arora, S.: Automated WordNet Construction Using Word Embeddings. In: SENSE 2017 (2017)
Google Scholar
Arora, S., Liang, Y., Ma, T.: A simple but tough-to-beat baseline for sentence embeddings. In: International Conference on Learning Representations (2017)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science, Georgia Institute of Technology, Atlanta, USA
Abhijit Suprem & Calton Pu
School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, USA
Duen Horng Chau

Authors

Abhijit Suprem
View author publications
You can also search for this author in PubMed Google Scholar
Duen Horng Chau
View author publications
You can also search for this author in PubMed Google Scholar
Calton Pu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Abhijit Suprem .

Editor information

Editors and Affiliations

The University of Hong Kong, Hong Kong, Hong Kong
Francis Y. L. Chin
University of Macau, Macao, Macao
C. L. Philip Chen
The University of Texas at Dallas, Richardson, Texas, USA
Latifur Khan
Louisiana State University, Baton Rouge, USA
Kisung Lee
Kingdee International Software Group Company Limited, Shenzhen, China
Liang-Jie Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Suprem, A., Chau, D.H., Pu, C. (2018). Approximate Query Matching for Graph-Based Holistic Image Retrieval. In: Chin, F., Chen, C., Khan, L., Lee, K., Zhang, LJ. (eds) Big Data – BigData 2018. BIGDATA 2018. Lecture Notes in Computer Science(), vol 10968. Springer, Cham. https://doi.org/10.1007/978-3-319-94301-5_6

Download citation

DOI: https://doi.org/10.1007/978-3-319-94301-5_6
Published: 21 June 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-94300-8
Online ISBN: 978-3-319-94301-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics