Keywords

1 Introduction

Scene graph generation (SGG) aims to generate scene graphs of images to model objects and their relationships. In the summary graph, the nodes represent detected objects, and the edges represent the relationships between object pairs. Scene graphs have been adopted in a wide range of high-level visual tasks, such as image captioning [1] and visual question answering [2]. Due to the wide application of scene graphs [3,4,5,6], SGG has become a hot topic recently.

In scene graph generation, a scene graph is collection of a visual triplets: subject-predicate-object, such as woman-holding-food and man-eating-food, which as shown in Fig. 1. When predicting relationships, one key is to explore and exploit the rich semantic and spatial information of pairwise objects. However, most current SGG methods only exploited visual information, semantic information and absolute positional information [7] of single objects, which can not explicitly and effectively model their relationships among pairwise objects. It is more significant to model the relative positional information of pairwise objects since different relative positional features may represent different relationships. As shown in Fig. 1 on the left, the food is far away from the woman, so holding is predicted. in Fig. 1 on the right, the food is near to the man, eating better describes their relationship between man and food than holding. Inspired by [8], we model relative positional information between object pairs by using the relative positions, including relative distances, relative scales and relative orientations. Methodologically, most existing approaches model semantic and spatial information by using the CNN framework [9], the RNN framework [10], or the attention framework [11]. Despite the success of these methods, they usually use an iterative modeling strategy to represent the single object context, which may limit the capability of modeling the contextualized representations.

Fig. 1.
figure 1

Examples of different relative positional information represent different relationships

In this paper, we propose a relative position relationship learning network (RPRL-Net) to explore and exploit the relative positional information of pairwise objects for SGG. To overcome the suboptimality of modeling the absolute positional information of single object and the iterative context modeling mechanism, a relative positional self-attention(RPSA) module is proposed to encode the relative positional information into objects and relationship contexts. Besides, in order to facilitate the fusion of semantic and relative positional information, a new technique is developed to encourage increased interaction between query, key and relative position embedding in the RPSA. Finally, we propose positional triplets,i.e., the absolute positional feature of subject and object as well as the relative positional feature between them, respectively. By fusing relationship contexts and positional triplets to predict relationships. The main contributions of this paper lie in two aspects:

  • In this paper, we propose a relative position relationship learning network(RPRL-Net) to explicitly represent their relationships between different positional objects for SGG. Besides, a relative positional self-attention(RPSA) module is developed to encode the relative positional information into object and relation contexts.

  • We perform extensive experiments on the Visual Genome (VG) dataset [12] and compare RPRL-Net with state-of-the-art scene graph generation methods. Experimental results verify the superior performance of the RPRL-Net compared with the state-of-the-art approaches.

The rest of this paper is organized as follows. Section 2 presents the details of our methods. Section 3 presents the experiments, followed by conclusion in Section 4.

2 Approach

In this section, we introduce the architecture of relative position relationship learning network (RPRL-Net) for SGG. Firstly, the feature representations from the input image based on a pre-trained object detector model is described. Then, we explain the details of RPSA module. Finally, the details of object classifier and relationship predictor are explained. An overview flowchart of RPRL-Net is shown in Fig. 2.

2.1 Initial Feature Module

We use Faster R-CNN to detect objects of input images [13]. For one image I, the initial feature module generates four types of features.

Visual Features: Each detected object is represented as a 4096-d vector by extracting fc7 feature after RoI Align and fc6 layer. Finally, the visual features represent \(\text {V}\) \(\in \) \(\mathbb {R}^{m\times {4096}}\).

Linguistic Features: We use a pretrained 300-d word embedding model [14] to transform the discrete labels into continuous linguistic features, obtaining a linguistic feature matrix of \(\text {L} \in \mathbb {R}^{m\times {300}}\).

Fig. 2.
figure 2

Flowchart of RPRL-Net consists of three modules: an initial feature module (IFM), an object classifier and a relationship predictor. RPRL-Net first obtains visual features (V), linguistic features (L) and relative positional features (PE) based on the IFM. Then, fusion features (X) and PE are fed to stacked RPSA module to obtain updated context features (C) and predict object label. Afterwards, updated fusion features (Y) and PE are fed to stacked RPSA module to obtain context features (W). The relationship predictor finally predicts the relationships based on updated context features (Z).

Absolute Positional Features: Absolute positional feature \(\text {AP}\) \(\in \) \(\mathbb {R}^{m\times {9}}\) includes the bounding box \((\frac{x_{1}}{w}, \frac{y_{1}}{h}, \frac{x_{2}}{w}, \frac{y_{2}}{h})\), center (\(\frac{x_{1}+x_{i2}}{2w}\),\(\frac{y_{1}+y_{2}}{2h}\)), sizes (\(\frac{x_{2}-x_{1}}{w}\),\(\frac{y_{2}-y_{1}}{h}\), \(\frac{(x_{2}-x_{1})(y_{2}-y_{1})}{wh}\)). Here, (\(x_{1}, y_{1}, x_{2}, y_{2}\)) are the bounding box coordinates of object proposals B. w and h are the image width and height.

Relative Positional Features: For m-th object and n-th object, the relative distances are calculated as:

$$\begin{aligned} \textbf{d}_{mn}=[\text {log}(\frac{|x_m-x_n|}{w_m}), \text {log}(\frac{|y_m-y_n|}{h_m})] \end{aligned}$$
(1)

the relative scales are defined as:

$$\begin{aligned} \textbf{s}_{mn}=[\text {log}(\frac{|w_n|}{w_m}), \text {log}(\frac{|h_n|}{h_m})] \end{aligned}$$
(2)

and the relative orientation is calculated as a cosine function:

$$\begin{aligned} \textbf{o}_{mn}=\frac{x_m-x_n}{\sqrt{(x_m-x_n)^2+(y_m-y_n)^2}} \end{aligned}$$
(3)

Finally, the relative positional features are represented as:

$$\begin{aligned} \textbf{pos}_{mn}=[\textbf{d}_{mn},\textbf{s}_{mn},\textbf{o}_{mn}] \end{aligned}$$
(4)

This 5-d relative positional features are embedded to a high-dimensional representation by method in [15], which computes cosine and sine functions of different wavelengths.

$$\begin{aligned} \text {PE}_{(i,pos)}=(\sin (\text {pos}/1000^{2i/d_{model}})||\cos (\text {pos}/1000^{2i/d_{model}})) \end{aligned}$$
(5)

where pos is the relative position and i is the dimension. That is, each dimension of the positional encoding corresponds to a sinusoid. \(d_{model}\) is the dimension of output feature. Finally, we obtain a feature matrix of \(\text {PE} \in \mathbb {R}^{m\times {n}\times {64}}\).

Fusion Features: We concatenate \(\text {V}\) and \(\text {L}\) are concatenated together and then linearly transform them to a matched dimensionality, resulting in the fused features \(\text {X} \in \mathbb {R}^{m\times {1024}}\). The process is calculated as:

$$\begin{aligned} \text {X}=\text {Linear}(\text {V}\Vert \text {L}) \end{aligned}$$
(6)

2.2 Relative Position Self-attention Module

Let \(\text {X}\) \(\in \) \(\mathbb {R}^{N\times d_o}\) denote the fusion feature set of objects. \(d_o\) is the feature dimension of \(\text {X}\). \(\text {X}\) is first fed into three parallel linear layers to obtain the queries \(\text {Q}\), keys \(\text {K}\), and values \(\text {V}\), respectively. \(\text {Q, K, V}\) is defined as:

$$\begin{aligned} \text {Q}= \text {Linear(X)}, \text {K}=\text {Linear(X)}, \text {V}=\text {Linear(X)} \end{aligned}$$
(7)

where Q, K ,V \(\in \mathbb {R}^{m\times {d_k}}\), \({d_k}\) is output feature dimension. Original self-attention module uses a scaled dot-product, which represents to compute similarity of fusion features. Inspired by [8], we encode the relative positional features into fusion features. The self-attention mechanism be rewritten as:

$$\begin{aligned} \text {SA}(\text {Q},\text {K},\text {V},\text {RP})=\text {Softmax}(\frac{\text {Q}\text {K}^T}{\sqrt{d_{k}}} + \text {RP})\text {V} \end{aligned}$$
(8)

where \(\sqrt{d_{k}}\) is a scaling factor following [15]. \(\text {RP} \in \mathbb {R}^{m\times {64}}\) is the updated relative positional feature. \(\text {RP}\) is defined as:

$$\begin{aligned} \text {RP}=\text {FC}(\text {Q}+\text {K}+\text {FC}(\text {PE})) \end{aligned}$$
(9)

where \(\text {FC (PE)}\) corresponds to a linear layer applied to the last axis of \(\text {PE}\). \(\text {RP}\) is the sum of query, key and relative position feature, which increases the interaction among them. In this method, \(\text {RP}\) serves as a gate to filter out the dot product of query and key. This gate would prevent a query from attending to a similar key (content-wise) heavily if the query and key positions are far away from each other.

The multihead variant of the attention module is popularly used which allows the model to jointly attend to information from different representation sub-spaces, and is defined as

$$\begin{aligned} \text {Multi-Head}(\text {Q},\text {K},\text {V},\text {RP})=\text {Concat}(\text {head}_1,\cdots ,\text {head}_H)\text {W}^o \end{aligned}$$
(10)
$$\begin{aligned} \text {head}_k=\text {SA}(\text {Q},\text {K},\text {V},\text {RP}) \end{aligned}$$
(11)

Finally, we further combine with the FFN layer to generate the relative positional self-attention module, which contains two fully connected layers:

$$\begin{aligned} \text {FFN}(X)=\text {FC}_{o1}(\sigma (\text {FC}_{o2}(X))) \end{aligned}$$
(12)

where \(\sigma \) indicates ReLU. The residual connection with layer normalization [15], which is defined as \(\text {X}\) = \(\text {X}+\text {LN}(\text {Fun}(\text {X}))\), is added to each attention network and each FFN. Here, \(\text {X}\) is the input feature set, \(\text {LN}(\cdot )\) indicates layer normalization, and \(\text {Fun}(\cdot )\) represents either an attention network or a FFN.

2.3 Object Classifier

In object classification, with considering the relative positional information and the interaction among the key, query, and relative position embedding, the fusion features and relative positional features are fed into stacked RPSA module to obtain the object context features. Then, the object context features X are projected into c-dimensional vectors \(O \in \mathbb {R}^{m\times {c}}\), where c is the number of object classes. Finally, we predict the refined object labels by using a softmax cross-entropy loss based on the c-dimensional vectors.

2.4 Relationship Predictor

Suppose an object proposal set \(\mathcal {B}\) = \(\{b\}\) is given. The updated fusion features \(\text {Y}\) of object proposals \(\mathcal {B}\) is initialized by fusing the visual features, linguistic features obtained from the corresponding object proposals \(\mathcal {B}\) and the object context features obtained from the last RPSA module layer. \(\text {Y}\) of \(\mathcal {B}\) is calculated as:

$$\begin{aligned} \text {Y}=\sigma (\text {FC}(\text {V}\Vert \text {L}\Vert \text {C})) \end{aligned}$$
(13)

Then, we feed \(\text {Y}\) into stacked RPSA module to obtain subject context features of subject proposals \(\text {SC}\) and object context features \(\text {OC}\) of object proposals, respectively.

Afterwards, the edge context features \({\text {Z}}_{so}\) between object pairs \(v_{so}\) is calculated as:

$$\begin{aligned} \text {Z}_{so}=\sigma (\text {FC}_{v3}(\text {FC}_{v1}(\text {SC})\Vert \text {FC}_{v2}(\text {OC})\Vert \text {FC} (\text {PE}_{so}+\text {AP}_s+\text {AP}_o))) \end{aligned}$$
(14)

where the \(\text {AP}_s\)-\(\text {PE}_{so}\)-\(\text {AP}_o\) indicates a position triplets, which consists of the absolute positional features of subject and object as well as the relative positional features between subject and object, respectively. Finally, we use the binary cross-entropy loss predict the relationship labels.

3 Experiments

Table 1. Performance comparison on SGDet of the VG dataset. We compute the R@20, R@50, R@100 and their mean with and without Graph constrained. “–” indicates the results are unavailable. The best performance is highlighted in boldface.

In this section, we conduct experiments to verify the effectiveness of the RPRL-Net on the commonly used benchmark.

3.1 Experimental Settings

Dataset: We use the Visual Genome dataset [12] to conduct all experiments. The VG dataset contains 108,077 images with average annotations of 38 objects and 22 relations per image. Following previous works in [17, 18], the most frequent 150 object categories and 50 predicate categories are utilized for evaluation, which split the dataset into 70K/5K/32K as train/validation/test sets.

Evaluation Tasks and Metrics: Following [17], scene graph detection (SGDet) task for SGG is adopted. SGDet generates scene graphs of images to predict the label of objects and relationships without extra-label information. Recall@K (R@K) is calculated by averaging the recall of the top K relationships of all images [21]. We use recall as the evaluation metric and \(K=\{20,50,100\}\) is reported in our experiments. The performance with and without graph constraint [18] is considered.

3.2 Implementation Details

To ensure a fair comparison of previous SGG, we use the codebase and pre-trained object detection model provided by [17]. The backbone is the Faster R-CNN with ResNeXt-101-FPN [22]. The hyperparameters mostly followed [17]. The SGD optimizer with a momentum of 0.9 is adopted. The warm-up strategy [15] is used to increase the learning rate from 0 to 0.001 in the first 5000 iterations. Then, the learning rate is decayed by 0.1 at 18,000 and 24,000 iterations. All training last for 30,000 iterations. The base learning rate is set to 0.001 and the batch size is set to 12. For each image, the top-80 object proposals are provided, and 256 relationship proposals, we set background/foreground ratio for relationship detection as 3/1.

3.3 Performance Comparison

Table 1 presents the results of RPRL-Net and six SGG methods on SGDet of the VG dataset. The results with and without graph constraints are provided. The best performance is highlighted in boldface. From Table 1, we use the same detector and backbone to extract object features. Compared with the second-best methods, RPRL-Net obtains a gain of 0.9% and 1.6% on R@50, and 1.0% and 1.6% on R@100 with and without graph constraint. RPRL-Net consistently outperforms existing state-of-the-art approaches in terms of R@20, R@50 and R@100 metrics on SGDet. These improvements again reveal the ability of RPRL-Net.

3.4 Ablation Studies

A number of experiments are conducted to explore the reasons behind RPRL-Net’s success. The results are shown in Table 2 and discussed below. We design four types of variants with different combinations: Baseline does not use RPSA module in object classifier and relationship predictor. B+SA use self-attention module without relative positional feature and interaction of Q, K and relative positional feature. B+SA+P use self-attention module with relative positional feature but does not use interaction of Q, K and relative positional feature. B+O-PRSA represent RPSA module is used in object classifier but not used in relationship predictor.

Table 2. Ablation studies

From Table 2, B+SA outperforms Baseline and B+SA+P outperforms B+SA. These improvements validate that relative positional information and the interaction of Q, K, relative positional feature have a positive influence on SGG. RPRL-Net outperforming B+O-PRSA indicate that using RPSA module both in object classifier and relationship predictor is better than only in object classifier.

4 Conclusion

In this paper, we propose a relative position relationship learning network (RPRL-Net) for SGG to explicitly represent their relationships between different positional objects because of the suboptimality of absolute position. The core of RPRL-Net is the relative positional self-attention (RPSA) module to encode the relative positional information into object and relation context. Moreover, the interaction of context feature as well as Q, K and relative positional feature is proposed to facilitate the understanding of object and relation semantics. Comprehensive experiments are conducted on the VG dataset. The experimental results demonstrate that RPRL-Net has high reasoning and integrating abilities.