Keywords

1 Introduction

Given the explosion of image-based content shared digitally, there has been similar research focus on developing object recognition, image recognition and variants, and image segmentation tools and systems for understanding, identifying, storing, and querying images. The major research thrusts include automatic image captioning, image annotation (including relationship detection), and content-based image retrieval. Automatic image captioning attempts to generate natural language captions for image images that try to represent the holistic meaning of the image. As the saying goes, ‘A picture is worth a thousand words’, and finding the right words is a non-trivial task. Image annotation focuses on object detection combined with relationship detection and grounding to identify canonical relationships within an image. Following [1], these are of the form <subject - predicate - object> . Finally, content-based image retrieval is a wide net; we can consider some recent work from [2], who focus on using automatic image annotation via scene graph grounding to retrieve images similar to a query scene graph.

1.1 Motivation

The motivation for this work comes from the need for holistic image retrieval systems. As [3] notes, text-based queries for image retrieval encode high levels of reasoning and abstraction about an image that is difficult to represent textually. In addition, text-based queries necessarily represent a semantic gap – a disconnect between human semantics and visual features. [2] demonstrates some drawbacks of current text-based image retrieval systems, namely the inability of such systems to abstract queries beyond index-based searches by recognizing the relationships between terms. We show this in Fig. 2, where common image search utilities return results that may not match the given query. We note in the first search engine’s results that the second through fifth results feature a woman eating a cake, while in the second, none of the first three feature a girl eating a cake. Our results, however, return images that contain both features of the query, where possible, at least one (image 4 in our results).

1.2 Overview

An overview of the process is shown in Fig. 1. Given the query, we extract the canonical forms - what we deem to be the smallest query unit. In the Visual Genome database, this is a subgraph of the form <subject - predicate - object> , where each of the subject, predicate, and object are nodes with edges between them labeled with the type of relation (subject-to-predicate or predicate-to-object). We the generate approximates of the canonical forms to broaden our search. In this case, approximates of <girl - eating - cake> include <girl.n.01 - eat.v.01 - patty.n.01> and <girl.n.01 - eat.v.01 - cake.n.03> . Note that these approximates are represented as WordNet synsets as they are sued to search the Visual Genome database, whose scene graphs also represent nodes as WordNet synsets. We retrieve images and ground them to the query – a process during which we identify which portions of the query graph the image satisfies and to what degree – using phrase similarities derived from high dimensional word embeddings (we use LexVec [4] for its state-of-the-art performance on Word Similarity measurements), Finally, we score each image on its grounding and rank the results.

Fig. 1.
figure 1

The user provides a query with two independent requests: girl - eating - cake and fork - on - plate . We generate queries that approximately match the given query using WordNet synsets ( girl.n.01 - eat.v.01 - patty.n.01, fork.n.01 - along.r.01 - plate.n.01 ). We then retrieve images that contain these subgraphs in their scene graphs and ground the subgraphs to our top-level natural language queries. We then map the retrieved synsets and the query nodes to the same word embedding space, and measure similarities to determine holistic image similarity. the image similarity scores are then used to rank the images.

Fig. 2.
figure 2

We show the results from common search engines for the same query. Our results are also presented. We note in the first set of results that the first two images do not include cake - in fact, the woman is eating peas and peppers. A woman eating cake appears third. In the second set of results, a profile view of a woman eating cake appears third in the results. Our results show the first two results with a woman (approximated as girl, as our far smaller image database did not contain any images that exactly matched the query) eating cake, with a fork on the plate.

Our work partially bridges the semantic gap by leveraging the idea of a scene graph as presented in the Visual Genome dataset [1]. A key contribution of [1] is the formalization of a scene graph – each image in the Visual Genome Dataset contains a human-generated graph of nodes denoting subjects, objects, and predicates, with the edges denoting the relations between them. The graph annotates the relational content in the image. The dataset also contains region-level and image-level natural language captions, along with WordNet synsets for objects and relationships. There is a preponderance of research surrounding the scene-graph formalization – Relationship Detection [5], Dense Caption Generation [6], and Scene Graph Generation [7]. We propose leveraging existing capabilities showcased in the above works for an efficient and scalable image retrieval pipeline.

Our contributions include:

Query Approximation and Ranking.

We present a model for performing approximate search on a scene graph database. Given a query of the form <subject - predicate - object> (e.g. <subject girl - eating - cake> , we generate approximate queries searching for terms similar to the query terms and evaluating phrases for their similarities to the query. Using this, we can generate, for the given example, query approximates that include woman - eating - cake, person - eating - cake, girl - eating - patty , and woman - eating - patty . In addition, we present an algorithm for grounding and evaluating retrieved scene graphs to the query graph for holistic image ranking.

Aggregate Graph Representations.

We present several graph database representations that are used in generating plausible query approximates by reducing the search space to the current semantic context. Specifically, we show how to create three aggregate graph representations that each index a different query node type: (i) a Subject Aggregate Graph (SAG) for obtaining subject approximates, (ii) an Object Aggregate Graph (OAG) for obtaining object approximates, and (iii) a Predicate Aggregate Graph (PAG) for obtaining predicate approximates, all in the current semantic context.

1.3 The Image Retrieval Pipeline

Figures 1 and 3 summarize our approach. We now focus on the pipeline in Fig. 3. Given a natural language graph query (e.g. a small scene graph as considered by [2]), we reduce the query to its canonical forms (Sect. 4.1); we define a canonical form as the basic unit of query consisting of a subject, predicate, and object of the form <subject - predicate - object> . We then generate approximate queries using WordNet synsets; we retrieve the set of candidate images that contain these approximates and ground the images’ scene graphs to the query. We then use word embeddings and project each canonical form to the vector space and score each image’s scene graph to the query graph, penalizing images with missing or inexact results.

Fig. 3.
figure 3

The user provides a natural language graph query. We convert this to the canonical form (the basic query unit of the form <subject - predicate - object> ) for each top-level query and generate the approximates. Plausible subject, object, and predicate approximates are generated using the aggregate graphs we have devised. These aggregate graphs – the Subject Aggregate Graph, the Object Aggregate Graph, and the Predicate Aggregate Graph – allow us to reduce the search space of subjects, objects, and predicates, respectively. Using the node approximates, we generate query approximates and retrieve the set of candidate images using an inverted index of the canonical queries from the Visual Genome dataset. We then ground the images to determine and rank how well the image represents our natural language query.

Three sample queries of varying complexity are presented in Fig. 4. Our approach handles both simple and complex queries. Our grounding ensures images that most closely match the entire query are ranked higher than images that offer a partial match. In addition, we bridge the ‘semantic gap’ by searching for approximates. This allows us to return images that closely match the provided query in situations where an exact match may not be possible; an example is shown in Fig. 5: there is no image that exactly matches the query; however, our approach approximates woman into girl, and finds images that contain most of the query: a girl eating cake, with the cake on a plate, and a fork on the plate as well. While the highest ranked image in the figure does contain cake with frosting, this is a happy coincidence; the scene graph itself does not contain the annotation, so we say the image mostly matches the query. We also show the second through fifth ranked images and their matches scene subgraphs.

Fig. 4.
figure 4

(a) A simple query. This is also the canonical form described in Sect. 4.1. (b) A more complex query. The same noun (cake) is the object of two separate subjects (woman and frosting), each connected by a different predicate . (c) Two independent top-level queries for the same image, i.e. they do not share any nodes between them. This consists of a ‘simple’ top-level query and a ‘complex’ top-level query, as per the informal terms adopted in (a) and (b).

Fig. 5.
figure 5

We show our system matching a complex query and returning images that most closely match all facets of the query.

2 Related Work

Content-Based Image Retrieval.

Content-based image retrieval involves retrieving and ranking images given either a text or image query. The former can be used to retrieve images that incorporate some of the query requirements, while the latter can be used to retrieve images relevantly similar to the query – here evaluation is context-specific, e.g. whether images have similar objects, similar colors, similar poses, or whether the images are exactly the same. Some implementations include Google Reverse Image Search and TinEye. Our focus is on the former approach, as we feel text-based queries allow users more freedom in specifying images to retrieve. It is not the case that a user always has a sample of the image she would like, or can sketch a faithful representation in, e.g., sketch-based image retrieval. [2] ’s key contribution is as follows: the authors develop a framework for generating accurate groundings of a query scene graph (either complete or partial) and use grounding likelihood to rank images for retrieval and display.

Semantic Similarity.

An integral aspect of our image retrieval pipeline is approximate query generation. We work with canonical relationship phrases of the form <subject - predicate - object> to generate query approximates. As the Visual Genome dataset maps each object to available WordNet synsets, we incorporate WordNet based semantic similarity measures. [8] proposes a domain-specific corpus-based training method to identify correct word sense and derives more accurate cosine-similarity measures between source and target words. [9] shows a simple baseline for WordNet synset similarity using vector embedding cosine similarity averages. [10] shows a sentence-based similarity measure that uses a TF-IDF analogue to compute similarities between a source word and its synset lemmas.

3 Graph Databases

As noted, [1] formalizes the scene graph – an image representation using human annotations on bounded regions that grounds objects, relationships, and attributes. Each object (usually a noun) is considered a node with a directed edge towards a relation (a predicate with a part-of-speech tag of verb, preposition, or action), The predicate may or may not have a directed edge towards another object instance. In addition, objects also have attributes (usually adjectives, but may also include actions). We will henceforth consider each object node as a noun with two forms: subject or object . Note that object and object operate under separate domains; however, notation confusion suggests these terms as an appropriate choice. A canonical form is a set of three nodes and two edges of the form <subject → predicate → object> : this triplet indicates a base query that we use as the smallest query unit (i.e. the canonical form).

3.1 Full Scene Graphs

We generate the full scene graph for each image using the scene_graphs from the Visual Genome dataset. The full scene graph is stored in a Neo4J database. Each full scene graph consists of at least one subgraph with at least one canonical triplet of the form <subject - predicate - object>. We note that there may be multiple independent subgraphs corresponding to different regions in the image.

3.2 Object Aggregate Graph

Our query approximation requires us to generate the candidate subjects, objects, and predicates for each top-level query. We develop a novel aggregate characterization of the aggregate graph that is introduced in [1] – we maintain a unique index of <subject → predicate> pairs, and for each pair, we maintain a unique list of objects that are associated with that subject → predicate pair. Objects are not unique across each subgraph in the aggregate graph: apple.n.01 may appear in multiple <subject → predicate> subgraphs; however, within a subgraph, each WordNet synset occurs once.

With the Object Aggregate Graph, we can, given a subject → predicate pair as well as several candidate objects , reduce the set of candidate objects to plausible candidates for that subject → predicate . Given candidate objects \( C_{O} \) and child objects \( K_{O} \) in the Object Aggregate Graph, we return \( C_{O} \cup K_{O} \). Figure 6 shows a subgraph within our Object Aggregate Graph.

Fig. 6.
figure 6

This subgraph from the Object Aggregate Graph shows all possible objects that a Fire Engine ( fire_engine.n.01 ) can be next to ( along.r.01 ). These include streets ( street.n.01 ), sidewalks ( sidewalk.n.01 ), and highways ( highway.n.01 ). Given a set of approximate synsets for street, as well as the fire_engine.n.01 → along.r.01 pair, we can identify the appropriate set of plausible objects using the process from Sect. 3.2.

3.3 Subject Aggregate Graph

To generate candidate subjects, we devise a Subject Aggregate Graph: we maintain a set of unique predicate → object pairs, and for each pair, we keep a unique set of subjects that are associated with that pair. So, given a predicate → object pair, and candidate subjects \( C_{S} \), we reduce it to the plausible set of subjects by returning \( C_{S} \cup K_{S} \) (where \( K_{S} \) are the parent objects in each subgraph in the Subject Aggregate Graph). We show this in Fig. 7.

Fig. 7.
figure 7

This subgraph from the Subject Aggregate Graph shows all possible objects that can eat ( eat.v.01 ) an apple ( apple.n.01 ). These include woman ( woman.n.01 ), child ( child.n.01 ), and girl ( girl.n.01 ).

3.4 Predicate Aggregate Graph

We adapt [1] ’s aggregate graph here; however, we include all nouns and predicates from the scene graphs, instead of the top-k nouns and predicates. Given a set of candidate subjects and candidate objects , we can obtain the set of plausible predicates, and return the union of the candidate and plausible predicates. We show a subgraph in Fig. 8.

Fig. 8.
figure 8

This partial predicate aggregate graph shows all possible nodes that are subjects and objects for the predicate dance ( dance.v.01 ). This allows us to determine plausible predicates given a subject-object pair.

4 Query Matching and Image Retrieval

4.1 Canonical Form

The canonical form of a query is a triplet (Fig. 4a) of the form <subject → predicate → object> . Given a graph query, we extract canonical forms from the query and operate on each independently, within scope of the query subgraph. Once again, we refer to Fig. 4c – this query consists of two independent subgraphs: one about a girl by a man, with the girl wearing a skirt and one about a truck on grass . We split the first query into <girl → wears → skirt> and <girl → by → man> . Similarly, we extract from truck on grass the triplet <truck → on → grass> .

We perform this triplet extraction under an independence assumption – that we can extract individual triplets, the combine them later to obtain the final rankings. This allows us to operate independently on each triplet and its set of query approximates.

4.2 Approximate Generation and Retrieval

Given a triplet, we obtain the WordNet approximates for the subject, predicate, and object. There are three levels of approximates – obtaining the sister synsets, the child synsets (hyponymy), and the parent synsets (hypernymy). The sister synsets are obtained by performing a lookup in the WordNet database of our natural language node label (i.e. ‘girl’ or ‘skirt’ or ‘wears’). there are four scopes available for synset lookup: we can take (i) just the sister synsets, (ii) the sister and child synsets, (iii) the sister and parent synsets, (iv) or all three hierarchies: sister, child, and parent synsets. We limit our choice for candidate subjects and objects to the sister synsets and for candidate predicates to sister and child synsets. This is to speed up computation time on our local machines; on parallelized clusters, such a limitation is not necessary, and we can use the complete closure of a synset: sisters, children, and parents.

Given the set of approximates for the subject and object, we obtain plausible predicates from the aggregate graph. The predicates are ranked to the provided predicate sister synsets: we measure the Wu & Palmer (WUP) Similarity between the plausible predicates and set of sister predicates, and taker the average similarity score. Here, we take the top 2/3 predicates as our set of predicate approximates. We choose WUP similarity as it weights synset edges by distance in the hierarchy, i.e. semantically similar predicates are ranked higher than synonymy predicates without semantic relation. This allows us to narrow results to more appropriate relations by context.

4.3 Synset Embeddings

For ranking, we prefer to use a Euclidean metric for measuring triplet distance between the query and approximates. However, the WordNet hierarchy does not operate on such a metric. We use a model similar to [9] – we determine the embeddings for each synset in our database following the baseline method for [9], but instead of the sum, we take the average of the embeddings sum to obtain the synset centroid in the embedding space with

$$ v_{s} = \frac{1}{{\left| {L_{S} } \right|}}\sum\nolimits_{{l \in L_{S} }} {v_{l} } $$
(1)

where \( v_{S} \) is the embeddings vector for a synset \( S \), \( L_{S} \) is the set of lemmas for the synset \( S \), and \( V_{l} \) is the embedding for each lemma \( l \in L_{S} \). We use LexVec [4] embeddings as the lookup table for \( v_{l} \) as LexVec has shown state-of-the-art performance in word similarity and analogy. We deal with compound words by averaging the embeddings for the compound word itself and the embeddings sum of its component words. If the compound word does not exist, we take only the embeddings sum of component words: given a compound word \( w = w_{1} , w_{2} , \cdots w_{n} \) (i.e. for \( w = \) christmas tree , \( w\_1 = \) christmas and \( w\_2 = \) tree ), we take \( v_{l}^{{\prime }} = \frac{1}{2}\left( { v_{l} + \sum\nolimits_{{l = l_{1} ,l_{2} , \cdots ,l_{n} }} {v_{{l_{i} }} } } \right) \).

4.4 Image Ranking

Approximate-Triplet Image Retrieval.

After triplet retrieval, we have, for each canonical triplet \( T_{i} \), a set of approximates \( A^{{T_{i} }} = a_{1} , a_{2} , \cdots ,a_{k} \). Each of these approximate is a triplet that is semantically similar to the query triplet. For each approximate \( a_{i} \), we have a set of images \( I_{1} , I_{2} , \cdots ,I_{n} \) that contain the approximate triplet within them. We generate an inverted index of images for each image, we collect all approximates \( a_{k} \) for each triplet \( T_{i} \) (Fig. 9). This allows us to work on an image-by-image basis by scoring each image based on its holistic similarity to the query triplet.

Fig. 9.
figure 9

We store all approximates for each triplet for each image. It is possible that images may not contain a relation, in which case the index will not contain any approximate for a given query ( truck on grass ).

Inverted Index.

Each approximate is a triplet of synsets, and we obtain the synset embeddings using the method in Eq. 1. We also obtain the embeddings for the canonical representations from the embeddings lookup (LexVec, in this case). At this point, we note that each approximate triplet is independent of every other approximate triplet. We further note that each approximate \( a_{i} \in I_{n} \), there may be multiple triplets that match the approximate. Consider an approximate triplet of the form man on grass : an image may contain multiple instances of this approximate, and each is stored in the inverted index. Each of these approximate primitives \( ap_{j} \) contains a grounding of its nodes to nodes in the parent query approximate \( a_{k} \), which itself contains a grounding to its parent triplet \( T_{i} \). As such, for each \( ap_{{i \in \left[ {1,j} \right]}} \), we know what it grounds to in \( T_{i} \). We further note that a single image triplet may appear in multiple primitives \( ap_{j} \) as the parent queries could have similar triplets in multiple subgraphs.

Subgraph Scoring.

We first ‘collapse’ the obtained triplets to their subgraphs \( s_{a} \) – we generate the image subgraphs that contain all approximate primitives within a single top-level query triplet \( T_{i} = s_{1} , s_{2} , \cdots s_{a} \). As such, we reduce all redundant copies of each unique node within the inverted index and connect the independent triplets wherever they share subjects or predicates. This is necessary for the query matching we will perform to rank each image approximate to the provided query. We want a notion of subgraph isomorphism to match out collapsed primitives to our top-level query \( T_{i} \). From Fig. 9, we note that there may be some \( T_{i} \) that do not contain approximates. For each \( s_{a} \), we obtain the cosine similarity between its synset embedding and its ground embedding, which is the embedding of the node in the top-level query \( T_{i} \). The cosine similarity is in \( \left[ {0,1} \right] \), where 1 indicates the highest similarity. For each node, we instead store \( n_{score} = 1 - {\text{cosine}}\_{\text{sim}}\left( {node,query} \right) \). For each missing node in \( s_{a} \), we add a null node with a distance of 1, representing a missing node.

Subgraph Ranking.

With this representation, we can now formulate this as a minimization problem: we wish to select the subgraph with the smallest score. Since each subgraph \( s_{a} \in T_{i} \) may contain a combination of approximates, it is intractable to calculate the global minimum selection of subgraphs. We instead model this as an analogue of vertex cover. By construction, each subgraph contains a unique set of approximates for the top-level query. As such, larger subgraphs are inherently more ‘isomorphic’ to each \( T_{i} \) simple because they contain more grounded nodes.

We then sort our subgraphs by size, and within each set of subgraphs with the same size, we pick the subgraph \( s_{m} \) with the smallest score. This subgraph \( s_{m} \) has a subset of the top-level approximates in \( T_{i} \). We remove these approximates from consideration, and again pick, from the remaining, the largest subgraphs. We repeat this until all top-level queries are satisfied for each top-level subgraph. Each of the selected subgraphs \( S_{m} \)’s scores \( S_{i} = score\left( {s_{1} , s_{2} , \cdots ,s_{m} } \right) \) are summed and normalized by the number of nodes to find the score on \( I_{n} \) for query \( T_{i} \). We repeat this for each query in \( I_{n} \) to obtain scores \( S = [S_{1} ,S_{2} , \cdots ,S_{n} ] \) for each query \( T_{i} \).

Image Scoring and Ranking.

We note that for each of the scores \( S = \left[ {S_{1} ,S_{2} , \cdots ,S_{n} } \right] \) is in \( \left[ {0,1} \right] \), with a lower score corresponding to a more similar match for each top-level query. It is straightforward then to represent these scores as an \( i \)-dimensional score vector \( \left[ {S_{1} ,S_{2} , \cdots ,S_{i} } \right] \) and measure its Euclidean distance from the origin. This gives us a score for image \( I_{n} \). We then sort the image score to obtain the image ranking under query, where the smallest score is the most relevant image.

5 Conclusions and Future Work

We have presented work on image retrieval using graph based approximate querying and ranking. Representative examples are shown in Figs. 1 and 5, as well as a comparison in Fig. 2. We note from Fig. 2 that our system returns relevant results at higher ranks than two leading Search Engines. We also note from Fig. 5 that even for more complex queries, out system can return results that closely match the provided query. Top ranked results match as much of the query as possible with holistic meaning – we reduce the ‘semantic gap’ by considering relations between objects in the image in lieu of using a document-based that eschews a focus on image relational content.

As such, the scene graph of an image is a key factor of our work. With regard to this, future work is two-fold:

  • Accurate scene graph generation: There is some work on scene graph generation in [7]. However, the authors note that the performance is subpar. Better state-of-the-art performance in automated scene graph generation from unannotated images would allow creation of image databases at scale. Our system can then be implemented on top of such a database for approximate image retrieval.

  • Query graph generation: We provide an informal comparison of our results to current search engine results in Fig. 2. However, this is not a robust comparison as search engines accept natural language input while we provide input directly as graph queries. This is due to a major input limitation in the conversion of natural language to scene graphs. Future work would focus on graph query generation from natural language that more closely matches desired human queries, using, e.g. dependency parsing or similar methods.