1 Introduction

Extracting structured, symbolic, semantic representations from data has a long history in Natural Language Processing (NLP), under the umbrella terms semantic parsing at the sentence level  [8, 9] and information extraction at the document level  [22, 41]. The resulting semantic graphs or knowledge graphs have many applications such as question answering  [7, 17] and information retrieval  [6, 50]. In computer vision, Xu et al. have recently called attention to the task of Scene Graph Generation (SGG)  [44], which aims at extracting a symbolic, graphical representation from a given image, where every node corresponds to a localized and categorized object (entity), and every edge encodes a pairwise interaction (predicate). This has inspired two lines of follow-up work, some improving the performance on SGG  [2, 10, 11, 23, 24, 31, 43, 47, 52], and others exploiting such rich structures for down-stream tasks such as Visual Question Answering (VQA)  [12, 38, 39, 54], image captioning  [48, 49], image retrieval  [15, 37], and image synthesis [14]. In VQA for instance, SGG not only improves performance, but also promotes interpretability and enables explainable reasoning [38].

Fig. 1.
figure 1

Left: An example of a Visual Genome image and its ground truth scene graph. Right: A relevant portion of the commonsense graph. In this paper we formulate the task of Scene Graph Generation as the problem of creating a bridge between these two graphs. Such bridge not only classifies each scene entity and predicate, but also creates an inter-connected heterogeneous graph whose rich structure is exploited by our method (GB-Net).

Although several methods have been proposed, the state-of-the-art performance for SGG is still far from acceptable. Most recently, [2] achieves only 16% mean recall, for matching the top 100 predicted subject-predicate-object triples against ground truth triples. This suggests the current SGG methods are insufficient to address the complexity of this task. Recently, a few papers have attempted to use external commonsense knowledge to advance SGG  [2, 10, 52], as well as other domains [3, 16]. This commonsense can range from curated knowledge bases such as ConceptNet  [27], ontologies such as WordNet  [30], or automatically extracted facts such as co-occurrence frequencies [52]. The key message of those works is that a prior knowledge about the world can be very helpful when perceiving a complex scene. If we know the relationship of a Person and a Bike is most likely riding, we can more easily disambiguate between riding, on, and attachedTo, and classify their relationship more accurately. Similarly, if we know a Man and a Woman are both sub-types of Person, even if we only see Man-riding-Bike in training data, we can generalize and recognize a Woman-riding-Bike triplet at test time. Although this idea is intuitively promising, existing methods that implement it have major limitations, as detailed in Sect. 2, and we address those in the proposed method.

More specifically, recent methods either use ad-hoc heuristics to integrate limited types of commonsense into the scene graph generation process  [2, 52], or fail to exploit the rich, graphical structure of commonsense knowledge  [10]. To devise a general framework for incorporating any type of graphical knowledge into the process of scene understanding, we take inspiration from early works on knowledge representation and applying structured grammars to computer vision problems  [32, 40, 55], and redefine those concepts in the light of the recent advances in graph-based deep learning. Simply put, we formulate both scene and commonsense graphs as knowledge graphs with entity and predicate nodes, and various types of edges. A scene graph node represents an entity or predicate instance in a specific image, while a commonsense graph node represents an entity or predicate class, which is a general concept independent of the image. Similarly, a scene graph edge indicates the participation of an entity instance (e.g. as a subject or object) in a predicate instance in a scene, while a commonsense edge states a general fact about the interaction of two concepts in the world. Figure 1 shows an example scene graph and commonsense graph side by side.

Based on this unified perspective, we reformulate the problem of scene graph generation from entity and predicate classification into the novel problem of bridging those two graphs. More specifically, we propose a method that given an image, initializes potential entity and predicate nodes, and then classifies each node by connecting it to its corresponding class node in the commonsense graph, through an edge we call a bridge. This establishes a connectivity between instance-level, visual knowledge and generic, commonsense knowledge. To incorporate the rich combination of visual and commonsense information in the SGG process, we propose a novel graphical neural network, that iteratively propagates messages between the scene and commonsense graphs, as well as within each of them, while gradually refining the bridge in each iteration. Our Graph Bridging Network, GB-Net, successively infers edges and nodes, allowing to simultaneously exploit and refine the rich, heterogeneous structure of the interconnected scene and commonsense graphs.

To evaluate the effectiveness of our method, we conduct extensive experiments on the Visual Genome  [20] dataset. The proposed GB-Net outperforms the state of the art consistently in various performance metrics. Through ablative studies, we show how each of the proposed ideas contribute to the results. We also publicly release a comprehensive software package based on  [52] and  [2], to reproduce all the numbers reported in this paper. We provide further quantitative, qualitative, and speed analysis in our Supplementary Material, as well as additional implementation details.

2 Related Work

2.1 Scene Graph Generation

Most SGG methods are based on an object detection backbone that extracts region proposals from the input image. They utilize some kind of information propagation module to incorporate context, and then classify each region to an object class, as well as each pair of regions to a relation class  [2, 44, 47, 52]. Our method has two key differences with this conventional process: firstly, our information propagation network operates on a larger graph which consists of not only object nodes, but also predicate nodes and commonsense graph nodes, and has a more complex structure. Secondly, we do not classify each object and relation using classifiers, but instead use a pairwise matching mechanism to connect them to corresponding class nodes in the commonsense graph.

More recently, a few methods  [2, 10, 52] have used external knowledge to enhance scene graph generation. This external knowledge is sometimes referred to as “commonsense”, because it encodes ontological knowledge about classes, rather than specific instances. Despite encouraging results, these methods have major limitations. Specifically, [52] used triplet frequency to bias the logits of their predicate classifier, and [2] used such frequencies to initialize edge weights on their graphs. Such external priors have been also shown beneficial for recognizing objects  [45, 46] and relationships  [26, 53], that are building blocks for SGG. Nevertheless, neither of those methods can incorporate other types or knowledge, such as the semantic hierarchy concepts, or object affordances. Gu et al.  [10] propose a more general way to incorporate knowledge in SGG, by retrieving a set of relevant facts for each object from a pool of commonsense facts. However, their method does not utilize the structure of the commonsense graph, and treats knowledge as a set of triplets. Our method considers commonsense as a general graph with several types of edges, explicitly integrates that graph with the scene graph by connecting corresponding nodes, and incorporates the rich structure of commonsense by graphical message passing.

2.2 Graph-Based Neural Networks

By Graph-based Neural Networks (GNN), we refer to the family of models that take a graph as input, and iteratively update the representation of each node by applying a learnable function (a.k.a., message) on the node’s neighbors. Graph Convolutional Networks (GCN) [19], Gated Graph Neural Networks (GGNN) [25], and others are all specific implementations of this general model. Most SGG methods use some variant of GNNs to propagate information between region proposals  [2, 24, 44, 47]. Our message passing method, detailed in Sect. 4, resembles GGNN but instead of propagating messages through a static graph, we update (some) edges as well. Few methods exist that dynamically update edges during message passing  [35, 51], but we are the first to refine edges between a scene graph and an external knowledge graph.

Apart from SGG, GNNs have been used in several other computer vision tasks, often in order to propagate context information across different objects in a scene. For instance, [28] injects a GNN into a Faster R-CNN [36] framework to contextualize the features of region proposals before classifying them. This improves the results since the presence of a table can affect the detection of a chair. On the other hand, some methods utilize GNNs on graphs that represent the ontology of concepts, rather than objects in a scene  [16, 21, 29, 42]. This often enables generalization to unseen or infrequent concepts by incorporating their relationship with frequently seen concepts. More similarly to our work, Chen et al.  [3] were the first to bring those two ideas together, and form a graph by objects in an image as well as object classes in the ontology. Nevertheless, the class nodes in that work were merely an auxiliary means to improve object features before classification. In contrast, we classify the nodes by explicitly inferring their connection to their corresponding class nodes. Moreover, we iteratively refine the bridge between scene and commonsense graphs to enhance our prediction. Furthermore, their task only involves objects and object classes, while we explore a more complex structure where predicates play an important role as well.

3 Problem Formulation

In this section, we first formalize the concepts of knowledge graph in general, and commonsense graph and scene graph in particular. Leveraging their similarities, we then reformulate the problem of scene graph generation as bridging these two graphs.

3.1 Knowledge Graphs

We define a knowledge graph as a set of entity and predicate nodes \((\mathcal {N}_\text {E},\mathcal {N}_\text {P})\), each with a semantic label, and a set of directed, weighted edges \(\mathcal {E}\) from a predefined set of types. Denoting by \(\varDelta \) a node type (here, either entity E or predicate P), the set of edges encoding the relation r between nodes of type \(\varDelta \) and \(\varDelta '\) is defined as

$$\begin{aligned} \begin{aligned} \mathcal {E}^{\varDelta \rightarrow \varDelta '}_r \subseteq \mathcal {N}_\varDelta \times \mathcal {N}_{\varDelta '} \rightarrow \mathbb {R}. \end{aligned} \end{aligned}$$
(1)

A commonsense graph is a type of knowledge graph in which each node represents the general concept of its semantic label, and hence each semantic label (entity or predicate class) appears in exactly one node. In such a graph, each edge encodes a relational fact involving a pair of concepts, such as Hand-partOf-Person and Cup-usedFor-Drinking. Formally, we define the set of commonsense entity (CE) nodes \(\mathcal {N}_\text {CE}\) and commonsense predicate (CP) nodes \(\mathcal {N}_\text {CP}\) as all entity and predicate classes in our task. Commonsense edges \(\mathcal {E}_\text {C}\) consist of 4 distinct subsets, depending on the source and destination node type:

$$\begin{aligned} \begin{aligned} \mathcal {E}_\text {C} =&\{\mathcal {E}^{\text {CE}\rightarrow \text {CP}}_r\} \cup \{\mathcal {E}^{\text {CP}\rightarrow \text {CE}}_r\} \, \cup \\ {}&\{\mathcal {E}^{\text {CE}\rightarrow \text {CE}}_r\} \cup \{\mathcal {E}^{\text {CP}\rightarrow \text {CP}}_r\}. \end{aligned} \end{aligned}$$
(2)

A scene graph is a different type of knowledge graph where: (a) each scene entity (SE) node is associated with a bounding box, referring to an image region, (b) each scene predicate (SP) node is associated with an ordered pair of SE nodes, namely a subject and an object, and (c) there are two types of undirected edges which connect each SP to its corresponding subject and object respectively. Here because we define knowledge edges to be directed, we model each undirected subject or object edge as two directed edges in the opposite directions, each with a distinct type. More specifically,

$$\begin{aligned} \begin{aligned} \mathcal {N}_\text {SE} \subseteq&[0,1]^4 \times \mathcal {N}_\text {CE} ,\\ \mathcal {N}_\text {SP} \subseteq&\mathcal {N}_\text {SE} \times \mathcal {N}_\text {SE} \times \mathcal {N}_\text {CP} ,\\ \mathcal {E}_\text {S} =&\{\mathcal {E}^{\text {SE}\rightarrow \text {SP}}_\texttt {subjectOf}, \mathcal {E}^{\text {SE}\rightarrow \text {SP}}_\texttt {objectOf},\\ {}&\;\, \mathcal {E}^{\text {SP}\rightarrow \text {SE}}_\texttt {hasSubject}, \mathcal {E}^{\text {SP}\rightarrow \text {SE}}_\texttt {hasObject}\},\end{aligned} \end{aligned}$$
(3)

where \([0,1]^4\) is the set of possible bounding boxes, and \(\mathcal {N}_\text {SE} \times \mathcal {N}_\text {SE} \times \mathcal {N}_\text {CP}\) is the set of all possible triples that consist of two scene entity nodes and a scene predicate node. Figure 1 shows an example of scene graph and commonsense graph side by side, to make their similarities clearer. Here we assume every scene graph node has a label that exists in the commonsense graph, since in reality some objects and predicates might belong to background classes, we consider a special commonsense node as background entity and another for background predicate.

3.2 Bridging Knowledge Graphs

Considering the similarity between the commonsense and scene graph formulations, we make a subtle refinement in the formulation to bridge these two graphs. Specifically, we remove the class from SE and SP nodes and instead encode it into a set of bridge edges \(\mathcal {E}_\text {B}\) that connect each SE or SP node to its corresponding class, i.e., a CE or CP node respectively:

$$\begin{aligned} \begin{aligned} \mathcal {N}_\text {SE}^\mathbf ? \subseteq&\, [0,1]^4 ,\\ \mathcal {N}_\text {SP}^\mathbf ? \subseteq&\, \mathcal {N}_\text {SE} \times \mathcal {N}_\text {SE} ,\\ \mathcal {E}_\text {B} =&\, \{\mathcal {E}^{\text {SE}\rightarrow \text {CE}}_\texttt {classifiedTo}, \mathcal {E}^{\text {SP}\rightarrow \text {CP}}_\texttt {classifiedTo},\\ {}&\;\, \mathcal {E}^{\text {CE}\rightarrow \text {SE}}_\texttt {hasInstance}, \mathcal {E}^{\text {CP}\rightarrow \text {SP}}_\texttt {hasInstance}\} ,\end{aligned} \end{aligned}$$
(4)

where \(.^\mathbf ? \) means the nodes are implicit, i.e., their classes are unknown. Each edge of type classifiedTo, connects an entity or predicate to its corresponding label in the commonsense graph, and has a reverse edge of type hasInstance which connects the commonsense node back to the instance. Based on this reformulation, we can define the problem of SGG as the extraction of implicit entity and predicate nodes from the image (hereafter called scene graph proposal), and then classifying them by connecting each entity or predicate to the corresponding node in the commonsense graph. Accordingly, Given an input image I and a provided and fixed commonsense graph, the goal of SGG with commonsense knowledge is to maximize

$$\begin{aligned} \begin{aligned}&p(\mathcal {N}_\text {SE}, \mathcal {N}_\text {SP}, \mathcal {E}_\text {S} | I, \mathcal {N}_\text {CE}, \mathcal {N}_\text {CP}, \mathcal {E}_\text {C}) \\&\qquad =\, p(\mathcal {N}_\text {SE}^\mathbf ? , \mathcal {N}_\text {SP}^\mathbf ? , \mathcal {E}_\text {S} | I) \times \\&\qquad p(\mathcal {E}_\text {B} | I, \mathcal {N}_\text {CE}, \mathcal {N}_\text {CP}, \mathcal {E}_\text {C}, \mathcal {N}_\text {SE}^\mathbf ? , \mathcal {N}_\text {SP}^\mathbf ? , \mathcal {E}_\text {S}) .\end{aligned} \end{aligned}$$
(5)
Fig. 2.
figure 2

An illustrative example of the GB-Net process. First, we initialize the scene graph and entity bridges using a Faster R-CNN. Then we propagate messages to update node representations, and use them to update the entity and predicate bridges. This is repeated T times and the final bridge determines the output label of each node.

In this paper, the first term is implemented as a region proposal network that infers \(\mathcal {N}_\text {SE}^\mathbf ? \) given the image, followed by a simple predicate proposal algorithm that considers all possible entity pairs as \(\mathcal {N}_\text {SP}^\mathbf ? \). The second term is fulfilled by the proposed GB-Net which infers bridge edges by incorporating the rich structure of the scene and commonsense graphs. Note that unlike most existing methods [2, 52], we do not factorize this into predicting entity classes given the image, and then predicate classes given entities. Therefore, our formulation is more general and allows the proposed method to classify entities and predicates jointly.

4 Method

The proposed method is illustrated in Fig. 2. Given an image, our model first applies a Faster R-CNN  [36] to detect objects, and represents them as scene entity (SE) nodes. It also creates a scene predicate (SP) node for each pair of entities, which forms a scene graph proposal, yet to be classified. Given this graph and a background commonsense graph, each with fixed internal connectivity, our goal is to create bridge edges between the two graphs that connect each instance (SE and SP node) to its corresponding class (CE and CP node). To this end, our model initializes entity bridges by connecting each SE to the CE that matches the label predicted by Faster R-CNN, and propagates messages among all nodes, through every edge type with dedicated message passing parameters. Given the updated node representations, it computes a pairwise similarity between every SP node and every CP node, and finds maximal similarity pairs to connect scene predicates to their corresponding classes, via predicate bridges. It also does the same for entity nodes to potentially refine their bridges too. Given the new bridges, it propagates messages again, and repeats this process for a predefined number of steps. The final state of the bridge determines which class each node belongs to, resulting in the output scene graph.

4.1 Graph Initialization

The object detector outputs a set of n detected objects, each with a bounding box \(b_j\), a label distribution \(p_j\) and an RoI-aligned  [36] feature vector \(\mathbf {v}_j\). Then we allocate a scene entity node (SE) for each object, and a scene predicate node (SP) for each pair of objects, representing the potential predicate with the two entities as its subject and object. Each entity is initialized using its RoI features \(\mathbf {v}_j\), and each predicate is initialized using the RoI features \(\mathbf {u}_j\) of a bounding box enclosing the union of its subject and object. Formally, we can write, i.e.,

$$\begin{aligned} \begin{aligned} \mathbf {x}^{\text {SE}}_j = \phi ^{\text {SE}}_\text {init}(\mathbf {v}_j) \;, \quad \text {and} \quad \mathbf {x}^{\text {SP}}_j = \phi ^{\text {SP}}_\text {init}(\mathbf {u}_j),\end{aligned} \end{aligned}$$
(6)

where \(\phi ^\text {SE}_\text {init}\) and \(\phi ^\text {SP}_\text {init}\) are two fully connected networks that are branched from the backbone after ROI-align. To form a scene graph proposal, we connect each predicate node to its subject and object via labeled edges. Specifically, we define the following 4 edge types: for a triplet \(s-p-o\), we connect p to s using a hasSubject edge, p to o using a hasObject edge, s to p using a subjectOf edge, and o to p using an objectOf edge. The reason we have two directions as separate types is that in the message passing phase, the way we use predicate information to update entities should be different from the way we use entities to update predicates.

On the other hand, we initialize the commonsense graph with commonsense entity nodes (CE) and commonsense predicate nodes (CP) using a linear projection of their word embeddings:

$$\begin{aligned} \begin{aligned}&\mathbf {x}^{\text {CE}}_i = \phi ^{\text {CE}}_\text {init}(\mathbf {e}^n_i) \;, \quad \text {and} \quad&\mathbf {x}^{\text {CP}}_i = \phi ^{\text {CP}}_\text {init}(\mathbf {e}^p_i).\end{aligned} \end{aligned}$$
(7)

The commonsense graph also has various types of edges, such as UsedFor and PartOf, as detailed in Sect. 5.2. Our method is independent of the types of commonsense edges, and can utilize any provided graph from any source.

So far, we have two isolated graphs, scene and commonsense. An SE node representing a detected Person intuitively refers to the Person concept in the ontology, and hence the Person node in the commonsense graph. Therefore, we connect each SE node to the CE node that corresponds the semantic label predicted by Faster R-CNN, via a \(\texttt {classifiedTo}\) edge type. Instead of a hard classification, we connect each entity to top \(K_\text {bridge}\) classes using \(p_j\) (class distribution predicted by Faster R-CNN) as weights. We also create a reverse connection from each CE node to corresponding SE nodes, using an \(\texttt {hasInstance}\) edge, but with the same weights \(p_j\). As mentioned earlier, this is to make sure information flows from commonsense to scene as well as scene to commonsense, but not in the same way. We similarly define two other edge types, \(\texttt {classifiedTo}\) and \(\texttt {hasInstance}\) for predicates, which are initially an empty set, and will be updated to bridge SP nodes to CP nodes as we explain in the following. These 4 edge types can be seen as flexible bridges that connect the two fixed graphs, which are considered latent variables to be determined by the model.

This forms a heterogeneous graph with four types of nodes (SE, SP, CE, and CP) and various types of edges: scene graph edges \(\mathcal {E}_\text {S}\) such as subjectOf, commonsense edges \(\mathcal {E}_\text {C}\) such as usedFor, and bridge edges \(\mathcal {E}_\text {B}\) such as classifiedTo. Next, we explain how our proposed method updates node representations and bridge edges, while keeps commonsense and scene edges constant.

4.2 Successive Message Passing and Bridging

Given a heterogeneous graph as described above, we employ a variant of GGNN  [25] to propagate information among nodes. First, each node representation is fed into a fully connected network to compute outgoing messages, that is

$$\begin{aligned} \begin{aligned} \mathbf {m}^{\varDelta \rightarrow }_i = \phi ^{\varDelta }_\text {send}(\mathbf {x}^{\varDelta }_i),\end{aligned} \end{aligned}$$
(8)

for each i and node type \(\varDelta \), where \(\phi _\text {send}\) is a trainable send head which has shared weights across nodes of each type. After computing outgoing messages, we send them through all outgoing edges, multiplying by the edge weight. Then for each node, we aggregate incoming messages, by first adding across edges of the same type, and then concatenating across edge types. We compute the incoming message for each node by applying another fully connected network on the aggregated messages:

$$\begin{aligned} \begin{aligned} \mathbf {m}^{\varDelta \leftarrow }_j = \phi ^{\varDelta }_\text {receive}\Bigg (\bigcup _{\varDelta '}\bigcup ^{\mathcal {E}_k \in \mathcal {E}^{\varDelta ' \rightarrow \varDelta }}\sum _{(i,j, a^k_{ij}) \in \mathcal {E}_k} a^k_{ij} \mathbf {m}^{\varDelta '\rightarrow }_i\Bigg ),\end{aligned} \end{aligned}$$
(9)

where \(\phi _\text {receive}\) is a trainable receive head and \(\cup \) denotes concatenation. Note that the first concatenation is over all 4 node types, the second concatenation is over all edge types from \(\varDelta '\) to \(\varDelta \), and the sum is over all edges of that type, where i and j are the head and tail nodes, and \(a^k_{ij}\) is the edge weight. Given the incoming message for each node, we update the representation of the node using a Gated Recurrent Unit (GRU) update rule, following [4]:

$$\begin{aligned} \begin{aligned} \mathbf {z}_j^\varDelta&= \sigma \big (W_z^\varDelta \mathbf {m}^{\varDelta \leftarrow }_j + U_z^\varDelta \mathbf {x}_j^\varDelta \big ),\\ \mathbf {r}_j^\varDelta&= \sigma \big (W_r^\varDelta \mathbf {m}^{\varDelta \leftarrow }_j + U_r^\varDelta \mathbf {x}_j^\varDelta \big ),\\ \mathbf {h}_j^\varDelta&= \tanh \big (W_h^\varDelta \mathbf {m}^{\varDelta \leftarrow }_j + U_h^\varDelta (\mathbf {r}_j^\varDelta \odot \mathbf {x}_j^\varDelta )\big ),\\ \mathbf {x}_j^\varDelta&\Leftarrow (1-\mathbf {z}_j^\varDelta )\odot \mathbf {x}_j^\varDelta + \mathbf {z}_j^\varDelta \odot \mathbf {h}_j^\varDelta ,\end{aligned} \end{aligned}$$
(10)

where \(\sigma \) is the sigmoid function, and \(W_.^\varDelta \) and \(U_.^\varDelta \) are trainable matrices that are shared across nodes of the same type, but distinct for each node type \(\varDelta \). This update rule can be seen as an extension of GGNN [25] to heterogeneous graphs, with a more complex message aggregation strategy. Note that \(\Leftarrow \) means we update the node representation. Mathematically, this means \(\mathbf {x}_{j(t+1)}^\varDelta = U(\mathbf {x}_{j(t)}^\varDelta )\), where U is the aforementioned update rule and (t) denotes iteration number. For simplicity, we drop this subscript throughout this paper.

So far, we have explained how to update node representations using graph edges. Now using the new node representations, we should update the bridge edges \(\mathcal {E}_\text {B}\) that connect scene nodes to commonsense nodes. To this end, we compute a pairwise similarity from each SE to all CE nodes, and from each SP to all CP nodes.

$$\begin{aligned} \begin{aligned} \mathbf {a}^\text {EB}_{ij} = \frac{\exp \langle \mathbf {x}_i^\text {SE}, \mathbf {x}_j^\text {CE} \rangle _\text {EB}}{\sum _{j'} \exp \langle \mathbf {x}_i^\text {SE}, \mathbf {x}_{j'}^\text {CE} \rangle _\text {EB}} \;, \quad \text {where} \quad \langle \mathbf {x}, \mathbf {y} \rangle _\text {EB} = \phi ^{\text {SE}}_\text {att}(\mathbf {x})^T \phi ^{\text {CE}}_\text {att}(\mathbf {y}),\end{aligned} \end{aligned}$$
(11)

and similarly for predicates,

$$\begin{aligned} \begin{aligned} \mathbf {a}^\text {PB}_{ij} = \frac{\exp \langle \mathbf {x}_i^\text {SP}, \mathbf {x}_j^\text {CP} \rangle _\text {PB}}{\sum _{j'} \exp \langle \mathbf {x}_i^\text {SP}, \mathbf {x}_{j'}^\text {CP} \rangle _\text {PB}} \;, \quad \text {where} \quad \langle \mathbf {x}, \mathbf {y} \rangle _\text {PB} = \phi ^{\text {SP}}_\text {att}(\mathbf {x})^T \phi ^{\text {CP}}_\text {att}(\mathbf {y}).\end{aligned} \end{aligned}$$
(12)

Here \(\phi ^{\varDelta }_\text {att}\) is a fully connected network that resembles attention head in transformers. Note that since \(\phi ^{\varDelta }_\text {att}\) is not shared across node types, our similarity metric is asymmetric. We use each \(\mathbf {a}^{\text {EB}}_{ij}\) to set the edge weight of the classifiedTo edge from \(\mathbf {x}_i^\text {SE}\) to \(\mathbf {x}_j^\text {CE}\), as well as the hasInstance edge from \(\mathbf {x}_j^\text {CE}\) to \(\mathbf {x}_i^\text {SE}\). Similarly we use each \(\mathbf {a}^{\text {PB}}_{ij}\) to set the weight of edges between \(\mathbf {x}_i^\text {SP}\) and \(\mathbf {x}_j^\text {CP}\). In preliminary experiments we realised that such fully connected bridges hurt performance in large graphs. Hence, we only keep the top \(K_\text {bridge}\) values of \(\mathbf {a}^{\text {EB}}_{ij}\) for each i, and set the rest to zero. We do the same thing for predicates, keeping the top \(K_\text {bridge}\) values of \(\mathbf {a}^{\text {PB}}_{ij}\) for each i. Given the updated bridges, we propagate messages again to update node representations, and iterate for a fixed number of steps, T. The final values of \(\mathbf {a}^{\text {EB}}_{ij}\) and \(\mathbf {a}^{\text {PB}}_{ij}\) are the outputs of our model, which can be used to classify each entity and predicate in the scene graph.

4.3 Training

We closely follow [2] which itself follows [52] for training procedure. Specifically, given the output and ground truth graphs, we align output entities and predicates to ground truth counterparts. To align entities we use IoU and predicates will be aligned naturally since they correspond to aligned pairs of entities. Then we use the output probability scores of each node to define a cross-entropy loss. The sum of all node-level loss values will be the objective function to be minimized using Adam [18].

Due to the highly imbalanced predicate statistics in Visual Genome, we observed that best-performing models usually concentrate their performance merely on the most frequent classes such as on and wearing. To alleviate this, we modify the basic cross-entropy objective that is commonly used by assigning an importance weight to each class. We follow the recently proposed class-balanced loss [5] where the weight of each class is inversely proportional to its frequency. More specifically, we use the following loss function for each predicate node:

$$\begin{aligned} \begin{aligned} \mathcal {L}^P_i = - \frac{1 - \beta }{1 - \beta ^{n_j}} \log \mathbf {a}^{\text {PB}}_{ij},\end{aligned} \end{aligned}$$
(13)

where j is the class index of the ground truth predicate aligned with i, \(n_j\) is the frequency of class j in training data, and \(\beta \) is a hyperparameter. Note that \(\beta =0\) leads to a regular cross-entropy loss, and the more it approaches 1, the more strictly it suppresses frequent classes. To be fair in comparison with other methods, we include a variant of our method without reweighting, which still outperforms all other methods.

5 Experiments

Following the literature, we use the large-scale Visual Genome benchmark [20] to evaluate our method. We first show our GB-Net outperforms the state of the art, by extensively evaluating it on 24 performance metrics. Then we present an ablation study to illustrate how each innovation contributes to the performance. In the Supplementary Material, we also provide a per-class performance breakdown to show the consistency and robustness of our performance across frequent and rare classes. That is accompanied by a computational speed analysis, and several qualitative examples of our generated graphs compared to the state of the art, side by side.

5.1 Task Description

Visual Genome [20] consists of 108,077 images with annotated objects (entities) and pairwise relationships (predicates), which is then post-processed by [44] to create scene graphs. They use the most frequent 150 entity classes and 50 predicate classes to filter the annotations. Figure 1 shows an example of their post-processed scene graphs which we use as ground truth. We closely follow their evaluation settings such as train and test splits.

The task of scene graph generation, as described in Sect. 4, is equivalent to the SGGen scenario proposed by [44] and followed ever since. Given an image, the task of SGGen is to jointly infer entities and predicates from scratch. Since this task is limited by the quality of the object proposals, [44] also introduced two other tasks that more clearly evaluate entity and predicate recognition. In SGCls, we take localization (here region proposal network) out of the picture, by providing the model with ground truth bounding boxes during test, simulating a perfect proposal model. In PredCls, we take object detection for granted, and provide the model with not only ground truth bounding boxes, but also their true entity class. In each task, the main evaluation metric is average per-image recall of the top K subject-predicate-object triplets. The confidence of a triplet that is used for ranking is computed by multiplying the classification confidence of all three elements. Given the ground truth scene graph, each predicate forms a triplet, which we match against the top K triplets in the output scene graph. A triplet is matched if all three elements are classified correctly, and the bounding boxes of subject and object match with an IoU of at least 0.5. Besides the choice of K, there are two other choices to be made: (1) Whether or not to enforce the so-called Graph Constraint (GC), which limits the top K triplets to only one predicate for each ordered entity pair, and (2) Whether to compute the recall for each predicate class separately and take the mean (mR), or compute a single recall for all triplets (R)  [2]. We comprehensively report both mean and overall recall, both with and without GC, and conventionally use both 50 and 100 for K, resulting in 8 metrics for each task, 24 in total.

5.2 Implementation Details

We use three-layer fully connected networks with ReLU activation for all trainable networks \(\phi _\text {init}\), \(\phi _\text {send}\), \(\phi _\text {receive}\) and \(\phi _\text {att}\). We set the dimension of node representations to 1024, and perform 3 message passing steps, except in ablation experiments where we try 1, 2 and 3. We tried various values for \(\beta \). Generally the higher it is, mean recall improves and recall falls. We found 0.999 is a good trade-off, and chose \(K_\text {bridge}=5\) empirically. All hyperparameters are tuned using a validation set randomly selected from training data. We borrow the Faster R-CNN trained by [52] and shared among all our baselines, which has a VGG-16 backbone and predicts 128 proposals.

In our commonsense graph, the nodes are the 151 entity classes and 51 predicate classes that are fixed by [44], including background. We use the GloVE [33] embedding of category titles to initialize their node representation (via \(\phi _\text {init}\)), and fix GloVE during training. We compile our commonsense edges from three sources, WordNet [30], ConceptNet [27], and Visual Genome. To summarize, there are three groups of edge types in our commonsense graph. We have SimilarTo from WordNet hierarchy, we have PartOf, RelatedTo, IsA, MannerOf, and UsedFor from ConceptNet, and finally from VG training data we have conditional probabilities of subject given predicate, predicate given subject, subject given object, etc. We explain the details in the supplementary material. The process of compiling and pruning the knowledge graph is semi-automatic and takes less than a day from a single person. We make it publicly available as a part of our code. We have also tried using each individual source (e.g. only ConceptNet) independently, which requires less effort, and does not significantly impact the performance. There are also recent approaches to automate the process of commonsense knowledge graph construction  [1, 13], which can be utilized to further reduce the manual labor.

Table 1. Evaluation in terms of mean and overall triplet recall, at top 50 and top 100, with and without Graph Constraint (GC), for the three tasks of SGGen, SGCls and PredCls. Numbers are in percentage. All baseline numbers were borrowed from [2]. Top two methods for each metric is shown in bold and italic respectively.

5.3 Main Results

Table 1 summarizes our results in comparison to the state of the art. IMP+ refers to the re-implementation of [44] by [52] using their new Faster R-CNN backbone. That method does not use any external knowledge and only uses message passing among the entities and predicates and then classifies each. Hence, it can be seen as a strong, but knowledge-free baseline. FREQ is a simple baseline proposed by [52], which predicts the most frequent predicate for any given pair of entity classes, solely based on statistics from the training data. FREQ surprisingly outperforms IMP+, confirming the efficacy of commonsense in SGG.

SMN [52] applies bi-directional LSTMs on top of the entity features, then classifies each entity and each pair. They bias their classifier logits using statistics from FREQ, which improves their total recall significantly, at the expense of higher bias against less frequent classes, as revealed by [2]. More recently, KERN [2] encodes VG statistics into the edge weights of the graph, which is then incorporated by propagating messages. Since it encodes statistics more implicitly, KERN is less biased compared to SMN, which improves mR. Our method improves both R and mR significantly, and our class-balanced model, GB-Net-\(\beta \), further enhances mR (\(+2.7\)% in average) without hurting R by much (\(-0.2\)%).

We observed that the state of the art performance has been saturated in the SGGen setting, especially for overall recall. This is partly because object detection performance is a bottleneck that limits the performance. It is worth noting that mean recall is a more important metric than overall recall, since most SGG methods tend to score a high overall recall by investing on few most frequent classes, and ignoring the rest  [2]. As shown in Table 1, our method achieves significant improvements in mean recall. We provide in-depth performance analysis by comparing our recall per predicate class with that of the state of the art, as well as qualitative analysis in the Supplementary Material.

There are other recent SGG methods that are not used for comparison here, because their evaluation settings are not identical to ours, and their code is not publicly available to the best of our knowledge  [10, 34]. For instance, [34] reports only 8 out of our 24 evaluation metrics, and although our method is superior in 6 metrics out of those 8, that is not sufficient to fairly compare the two methods.

Table 2. Ablation study on Visual Genome. All numbers are in percentage, and graph constraint is enforced.

5.4 Ablation Study

To further explain our performance improvement, Table 2 compares our full method with its weaker variants. Specifically, to investigate the effectiveness of commonsense knowledge, we remove the commonsense graph and instead classify each node in our graph using a 2-layer fully connected classifier after message passing. This negatively impacts performance in all metrics, proving our method is able to exploit commonsense knowledge through the proposed bridging technique. Moreover, to highlight the importance of our proposed message passing and bridge refinement process, we repeated the experiments with fewer steps. We observe the performance drops significantly with fewer steps, proving the effectiveness of our model, but it saturates as we go beyond 3 steps.

6 Conclusion

We proposed a new method for Scene Graph Generation that incorporates external commonsense knowledge in a novel, graphical neural framework. We unified the formulation of scene graph and commonsense graph as two types of knowledge graph, which are fused into a single graph through a dynamic message passing and bridging algorithm. Our method iteratively propagates messages to update nodes, then compares nodes to update bridge edges, and repeats until the two graphs are carefully connected. Through extensive experiments, we showed our method outperforms the state of the art in various metrics.