1 Introduction

Scene graph generation (SGG) is a crucial task that benefits image captioning [1, 2], visual question answering [3, 4], video understanding [5, 6] and detection [7, 8]. However, most generated scene graphs face the challenge of trivial predictions, thus far from being applied to practical applications.

Fig. 1
figure 1

An example of multi-semantic language assistance for relation prediction. Bottom-right corner (green bars) shows 3 candidates’ possibility updating process. a) Experience Estimation: Humans recall a rough predicate distribution based on the co-occurrence possibility of predicates and object pairs. b) Pattern Attention: Using the internal relationship between object pair and visual information for locating predicate-relevant visual patterns. c) Context inference: Using context to refine the meaning of subject and object, preventing prediction biases caused by isolated considering object pairs

Therefore, recent researchers have been working on unbiased methods that elevate Recall of hardly distinguishable predicates. Generally, unbiased methods can be divided into 3 types: data resampling (e.g., BLS [9], GCL [10], DCNet [11]), predicate-aware loss design (e.g., CogTree [12], PCL [13] and FGPL [14]) and logit manipulation (e.g., TDE [15], RTPB [16], FREQ [17]). However, a common drawback is that they rely on explicitly modeling predicate correlations from dataset statistical information [14, 17] or biased predictions [14, 15], which means that they are sensitive to prerequisite changes. For instance, [15] is not effective when training on an unbiased model, hence, confining the SGG model performance. Compared with loss and statistic approaches, language representation learning is much more robust because it learns implicit patterns of predicates and avoids visual feature redundancy, which has not been stressed by unbiased methods before.

However, language representation learning has been adopted by some baseline models. For example, [18] takes word embedding to ground attention on visual features. [10] utilizes Cross Attention (CA) mechanism for multi-modality learning. [19] introduces transformer-based architecture to bridge the gap between images and texts. However, most of these approaches are not plug-and-play and merely use single representations regardless of different semantic contexts. Therefore, failed to unleash the power of language.

In fact, words have multiple meaning that carries different priors in terms of different semantics, which can guide scene graph generation. Here, we give a multi-semantic reasoning example in Fig. 1. Given this shopping picture, first, human constructs a predicate distribution by the correlation between predicates and “woman-ball” as well as their relative position. This knowledge comes from experience and the process is vision-independent. Next, still based on this area, human can build the correlation between subject-object pairs and the visual pattern like “woman’s hand is closed to ball”, which is a strong “holding” relevant pattern. In contrast, the woman and girl’s contours are irrelevant in terms of judging relations. Finally, according to surrounding objects (e.g., balls, girls, lights), humans can infer the selling context to avoid predicting play, because it is not suitable to this scene.

Motivated by these observations, we heuristically design 3 plug-and-play language modules that exploit different language priors behind object categories. Each of them takes detected object classes as input and generates semantic embeddings into different semantic spaces, which are used for extracting priors from language-visual pattern correlations, language context and pair-predicate correlation, respectively. Concretely, 1) Language Attention Module projects subject-object word embeddings to a unified semantic matrix, then, channel attention is used to extract attention vector for relation visual feature map, which can learn relevance between object pair and specific predicate-relevant visual patterns 2) Language Context Module employs transformer-based encoder to encode the global language context into entity’s semantic embedding from a sequence of entity labels. Compared with pretrained word embedding, this module can generate semantic representation that fits the context. These two modules are used for initializing entity and relation visual-based representations, respectively. 3) Experience Estimation Module is supervised by marginal probability of subject-predicate and object-predicate to learn the class and spatial aware predicate distribution as likelihood offset. It is worth mentioning that language processing is disentangled from visual feature at the very beginning, so this framework is applicable to most SGG baseline models.

To the best of our knowledge, we are the first to utilize multi-semantic language representation within object labels to achieve unbiased Scene graph generation. The main contributions could be summarized as follows:

  1. 1.

    We propose LANDMARK to introduce language representation learning into unbiased scene graph generation, which stresses the under-explored multiple semantics utilization in the object label.

  2. 2.

    We devise three modules that divide object labels into distinctive semantic spaces, then extract priors of language-vision interactive patterns and semantic context as well as pair-predicate correlation, respectively.

  3. 3.

    Experiments on the SGG benchmark show consistent improvements upon baseline models and compatibility with other unbiased methods, which indicate the effectiveness of multi-semantic language representations induced from object labels.

Fig. 2
figure 2

LANDMARK architecture. The image first goes through a generic object detector (Faster R-CNN) to get the predicted object’s label and ROI feature. Then labels are served as three modules input (i.e., EEM, LAM, LCM) and calculated distinctive semantics. Finally, LAM and LCM outputs are used to refine relation and entity representations, respectively. EEM’s output is served as a prediction offset

2 Related works

Scene Graph Generation: There are mainly two mainstream methods for scene graph generation: Based on context modeling or graph convolutional network (GCN) [20]. The first approach is focused on modeling global information by sequential architecture [16, 17, 21,22,23,24,25,26,27,28]. Chen et al. [16] uses two stacks of Transformer to encode global information. Zellers et al. [17] uses LSTM to encode global context that informs relation prediction. However, merely modeling global context is not sufficient for Scene graph tasks. Another approach [9, 29,30,31] propagates massage between node and edge features and focuses more on regional pair-wise information. [9] applys a multi-stage graph message propagation between entities and relationship representations. In [30], Yang et al. prunes graphs to sparse ones, then an attentional graph convolution network is applied for modulating information flow. [29] utilizes GCN for updating state representations as energy value. Chen et al.[32] constructs a graph between entities and all relationship representations and aggregated messages by GRU. Yet, this approach suffers from insufficient global context encoding. Our method considers both pair-wise and global contexts for representation refinement.

Unbiased Scene Graph Generation: USGG has been a hotspot research area since existing SGG datasets are long-tail. FREQ [17] uses a distributed-based prior bias to predictions. Chen et al. [16] utilizes a resistance bias item for the relationship classifier during training to optimize the loss value. CogTree [12] proposed a loss based on the automatically built cognitive structure of the relationships from the biased SGG predictions. [33] designs two separate classifiers for head and tail predicates. Through mentioned methods get a remarkable boost on specific baseline models in terms of mRecall@k [34], they are sensitive to training data distribution and easily overfitting to tail classes. More recently, [35] observes that directly using visual features results in biased relation predictions. Later, [36] avoids directly using visual features by utilizing Hermitian inner product to embed visual features into complex space. These observations inspire us to design a network that learns language-guided visual features and explores multi-semantics from words.

Multimodel Learning: Multimodel representation learning has been widely explored in zero-shot image retrieval [37], text-video retrieval [38, 39], and object detection [40]. In the SGG task, language and commonsense are treated as multimodel representations. [19] parses sentences in the image-text dataset to extract triplet as supervision. [41] incorporated commonsense by unifying the form of scene graph and commonsense graph. [10] adopts cross-attention modules between vision and text embedding. However, existing methods treat language modality as a single representation, which loses a lot of information from other perspectives.

3 Methodology

Problem formulation: Given an image I, scene graph generation aims to predict entity class set \(\mathcal {C}_{e}\), coordinate set \(\mathcal {B}_{e}\) and relation set \(\mathcal {C}_{r}\). Generally, existing SGG models receive visual entity and predicate (or node and edge) representations from the backbone. Then a graph \(\mathcal {G} = \{\mathcal {C}_{e},\mathcal {B}_{e}, \mathcal {C}_{r}\}\) can be formulated as

$$\begin{aligned} \mathcal {G}=P(\mathcal {C}_{r},\mathcal {B}_{e},\mathcal {C}_{e}\vert I)=SGG({N},{E}), \end{aligned}$$
(1)

where \({N}= \{ e_{i}\}_{i=1}^{n} \) and \({E}=\{ e_{ij}\}\) are the set of entity and predicate representations. In this paper, we aimed to update N and E by incorporating semantic priors.

Framework overview: LANDMARK consists of three semantic learning modules, i.e., Language Attention Module (LAM), Language Context Module (LCM), and Experience Estimation Module (EEM). The framework architecture is shown in Fig. 2. First, we obtain N, E from ROI Pooling, \(\mathcal {C}_{e}\), and \(\mathcal {B}_{e}\) from the classifier head. Then, the semantic extraction operation converts labels \(\{ c_{i}\} \) to semantic embeddings for each module. For LAM, semantic embeddings of subject \(c_{i}\) and object \(c_{j}\) are transposed and multiplied to a semantic matrix, then channel attention transfers the matrix to the attention vector and updates relation representation. LCM encodes semantic embeddings of all entity labels \(\mathcal {C}_{e}\) presented in the image, generating context-aware semantic entity feature and concatenating it with visual entity representation. The updated entity and relation representation are passed through the baseline model. For EEM, the distribution label is generated to supervise experience estimator, which combines subject-object semantic embedding with position embedding to yield distribution logits. Finally, generated logits are used to update the final predicate likelihood.

3.1 Semantic extraction

Semantic extraction is used to transfer labels to the corresponding semantic space, which is applied to three modules independently. Specifically, the semantic extractor consists of three operations:

$$\begin{aligned} \left\{ \begin{array}{rl} f_{se}^{sub}(c_{i}) &{}=w^{T}_{s}c_{i}, \\ f_{se}^{obj}(c_{j}) &{}=w^{T}_{o}c_{j}, \\ f_{se}^{ent}(c_{e}) &{}=w^{T}_{e}c_{e}, \end{array} \right. \end{aligned}$$
(2)

where the first and second operations are used in LAM and EEM for subject and object projection. Considering that the same word as subject or object may have contrastive meanings (e.g., eating could be a possible predicate if “man” is subject, which is impossible when “man” is object), we use different weights \(w_{s}\) and \(w_{o}\) to project subject and object to semantic embedding. The last operation is used in LCM, since all labels are treated as objects, we use unified \(w_{e}\) as semantic embedding weight.

In fact, \(f_{se}\) could be any projection function as long as the input is object labels. Here, we only use a naive 1-layer linear function to prove the extraction’s effectiveness.

3.2 Language Attention Module

This module aims to learn the prior between object pair and visual predicate-relevant patterns within visual relation representation \(e_{ij}\). The original feature extraction network (e.g., Reset [42]) keeps both spatial and semantic information. Specifically, different channels focus on different visual patterns. However, relation visual features inevitably mix up with a huge amount of irrelevant background information, so there is a need for channel selection. Heuristically, given a specific subject-object pair (e.g., boy-basketball or boy-street), the visual feature should have different activation. Therefore, we design a label-aware channel attention mechanism. Specifically, given subject i and object j, we first generate a semantic matrix \(x_{ij}\) as a unique representation of word-vision correlation:

$$\begin{aligned} x_{ij}=f_{se}^{sub}(c_{i}) \otimes f_{se}^{obj}(c_{j})^T, \end{aligned}$$
(3)

where \(\otimes \) refers to matrix multiplication. We achieve channel attention by a series of 2D convolutions with the spatial pooling on \(x_{ij}\) to get attention vector \(e_{ij}^{c}\):

$$\begin{aligned} e_{ij}^{c} = \sigma (G_{\text {pooling}}(G^{n_{c}}_{\text {conv}}...\sigma (G^{1}_{\text {conv}}(x_{ij}))))\in \mathcal {R}^{C,1}, \end{aligned}$$
(4)

where C is as the channel number of visual relation feature \(e_{ij}\), \(n_{c}\) is the number of 2D convolution layers, \(G_{\text {pooling}}\) is the pooling operation, \(\sigma \) is the activation function. Finally, the channel weights \(e_{ij}^{c}\) will be used for updating \(e_{ij}\), so that irrelevant channels to relation discrimination will be suppressed:

$$\begin{aligned} \hat{e}_{ij}= e_{ij} \times e_{ij}^{c}, \end{aligned}$$
(5)

where \(\hat{e}_{ij}\) is the refined relation representation, \(\times \) is the dot product operator.

Fig. 3
figure 3

Context Encoder block structure (left). Dot-Product Attention (middle). Multi-Head Self Attention (right)

3.3 Language Context Module

Compared with visual information, a single word is semantically isolated from other components in a sentence. Though we devise LAM and EEM for semantic extraction, the utilized pairwise labels are confined to local semantics, which is insufficient for comprehensive semantic inference. Hence, LCM is aiming at addressing the global context deficiency problem. This module includes a semantic extractor and context encoder. The context encoder consists of a multilayer transformer encoder with Multi-Head Self-Attention (MHSA) [26] and Feed Forward Network (FFN) [26]. The structure is illustrated in Fig. 3. Concretely, given an image, supposed there are n entities, the input sequence X could be described as follows:

$$\begin{aligned} X = \left\{ s_{0},s_{1},...,s_{n}\right\} , \end{aligned}$$
(6)

where

$$\begin{aligned} s_{i}= \gamma (f_{se}^{e}c_{i} + p_{i}) \in \mathbb {R}^{d}. \end{aligned}$$
(7)

Here, \(c_i\) and \(p_{i}=\phi [x_{i},y_{i},w_{i},h_{i}]\) refer to the class label and entity’s position embedding of the i-th entity. \(x_{i}, y_{i}, w_{i}, h_{i}\) are center coordinates, width, and height of the object i. \(\gamma \) denotes the learnable linear transformation. Where d is the dimension of each element in the sequence. We first reiterate standard Scaled Dot-Product Attention [26] as below:

$$\begin{aligned} \text {Attention}(Q, K, V)=\text {softmax}\left( \frac{Q K^T}{\sqrt{d_k}}\right) V, \end{aligned}$$
(8)

then, Multi-Head Self Attention is formulated as

$$\begin{aligned} \text {MHSA}(X)&= \text{ Concat } \left( \text {head}_1, \ldots , \text {head}_{\textrm{h}}\right) W^O, \nonumber \\ \text {head}_{\textrm{i}}&=\text {Attention}\ \left( X W_i^Q, X W_i^K, X W_i^V\right) , \end{aligned}$$
(9)

where \(W_i^Q , W_i^K, W_i^V\) and \(W^O \) are parameter matrices. The b-th layer output \(X_{b}\) can be denoted as

$$\begin{aligned} X_{b}' = \text {MHSA}(\text {LN}(X_{b-1}))&+ X_{b-1},\end{aligned}$$
(10)
$$\begin{aligned} X_{b} = \text {FFN}(\text {LN}(X_b'))&+ X_{b}', \end{aligned}$$
(11)

where \(X_{b-1}\) is the \((b-1)\)th layer, and we set \(X_0 = X\). LN is layer normalization. Differing from previous works [16], LCM takes a sequence of entity labels as inputs, so the module can learn semantic entity representation on top of high-level information.

Fig. 4
figure 4

Joint possibility for the subject man and object surfboard. The ground truth triplet \(\langle \text {man,of,surfboard}\rangle \) is sampled from zero-shot split (where triplets are only existed in the evaluation set but not occurred in the training set)

3.4 Experience Estimation Module

This module is designed to learn the relationship distribution prior to the subject-object pair, as compensation for cross entropy loss. EEM consists of a semantic extractor, experience estimator, and distribution label for supervision. Since entity class and position both have an influence on judging relation, This module utilizes both classes and position information to learn precise predicate distribution. First, we embed entity label i, j and position embedding \(p_{ij}\) to high dimension representation space, then, the experience estimator predicts the predicate distribution \(d_{ij}\) between subject i and object j:

$$\begin{aligned} p_{ij}&=\phi _{p}[x_{i},y_{i},w_{i},h_{i},x_{j},y_{j},w_{j},h_{j}],\end{aligned}$$
(12)
$$\begin{aligned} {d}_{ij}&=\varphi \left[ \phi _{s}(f_{se}^{sub}(c_{i})) \phi _{o}(f_{se}^{obj}(c_{j})),p_{ij}\right] , \end{aligned}$$
(13)

where \([\cdot , \cdot ]\) refers to concatenation operation, \(\phi _{p},\phi _{s},\phi _{o},\varphi \) are fully connected layers with RELU as an activation function. Finally, we merge relationship distribution \(d_{ij}\) with the prediction from the baseline, which could be described as

$$\begin{aligned} \hat{d}_{i,j} = SGG(\hat{n}_{ij},\hat{e}_{ij})+d_{ij}, \end{aligned}$$
(14)

where \(\hat{d}_{i,j}\) is the updated prediction likelihood. \(\hat{n}_{ij}\) and \(\hat{e}_{ij}\) are enhanced entity and relation feature obtained by LAM and CAM (Sections 3.2, 3.3), respectively.

Distribution Label Generation: Dataset annotations inherently reflect human commonsense, so we manage to generate accurate distribution labels from the dataset. For subject i, object j, we obtain \(m^{sub}_{i}\) \( m^{obj}_{j}\) as “subject-predicate” and “predicate-object” marginal distributions. Since \(m^{sub}_{i}, m^{obj}_{j}\) are independent distribution, we calculate joint possibility \(p^{joint}_{ij}\) as

$$\begin{aligned} p^{joint}_{ij} =m^{sub}_{i} \odot m^{obj }_{j}, \end{aligned}$$
(15)

where \(\odot \) denotes the element-wise product. Though EEM is like FREQ [17] generates predicate distribution from statistics, FREQ directly counts triplets \(\langle subject, predicate,\) \(object \rangle \) occurrence. However, some triplet samples are scarce in training samples, so it is hard to establish an informative distribution prior. In contrast, EEM uses joint possibility as labels to infer the predicate distribution. Figure 4 is a typical example of generated joint possibility \(p^{joint}_{ij}\) of a triplet that not occurred in the training set. In this circumstance, FREQ could not work due to zero sample number, whereas, we can see that the top 5 highest likelihoods are reasonable and include many possible scenarios. In contrast, the predicate “of” is ambiguous. A better predicate could be one of top likelihoods from joint possibility.

Table 1 Comparison between baseline models and LANDMARK under three sub-tasks on the VG dataset

Considering that introducing position information makes accurate predicate possible, we design a fusion function for mitigating joint possibility and true predicate label. The distribution label \(l_{ij}\) can be denoted as

$$\begin{aligned} l_{ij} = \mu \times p^{joint}_{ij} + (1-\mu )\times \text {onehot}[r_{ij}], \end{aligned}$$
(16)

where \(\mu \) is a factor regulating the proportion of marginal frequency. \(r_{ij}\) refers to the true predicate label between subject i and object j.

Objective Function: we choose MSE loss between distribution label \(l_{i,j}\) and predicted distribution \({d}_{i,j}\), which is denoted as

$$\begin{aligned} \text {MSE} = \frac{1}{ K }\sum _{i=1}^{K}({d}_{i},l_{i})^{2}, \end{aligned}$$
(17)

where K is the number of relation categories in dataset.

3.5 Baseline model

Any off-the-shelf two-stage scene graph generation model can be used as a baseline. It could be either a sequential model or a graph neural network, as long as it needs entity and relation features for prediction.

4 Experiments

In this section, we first introduce the experiment settings in our experiments. Then, we test our framework’s effectiveness on several SGG models and conduct experiments to analyze the compatibility between our method and the state-of-art unbiased strategies. Finally, detailed analyses and qualitative studies are presented to further verify LANDMARK’s superiority from different perspectives.

4.1 Experiment settings

Datasets We employ the widely adopted Visual Genome [43] dataset’s split VG150 [44] to train and evaluate our framework. The VG150 dataset contains the most frequent 150 object categories and 50 predicate categories in VG. It consists of more than 108k images, with 70% of images held out for training and 30% for testing. Among the training set, 5000 images are used for evaluation.

Table 2 Compatibility test of unbiased methods and LANDMARK under three sub-tasks on the VG dataset
Fig. 5
figure 5

The number of data samples (bars) and Recall@100 improvements (dots) of LANDMARK over BGNN on PredCls task. The red dots indicate that BGNN Recall@100 is zero. PCC is the abbreviation of Pearson Correlation Coefficient

Tasks. We consider three conventional sub-tasks of scene graph generation to evaluate our framework. 1) Predicate Classification (PredCls) predicts relationships between each object pair given their ground-truth bounding boxes and classes. 2) Scene Graph Classification (SGCls) predicts the object classes and their relationships given the ground-truth bounding boxes of objects. 3) Scene Graph Detection (SGDet) needs to detect object classes and bounding boxes, then predict their relationships.

Evaluation Metrics we report widely accepted Recall@k [34] and mean Recall@k [32] for evaluating the model’s performance. Recall@k computes the fraction of times the correct relationship is predicted in the top k predictions of one image, the mean Recall is used to evaluate unbiased performance. However, both metrics are calculated based on the image level. In order to evaluate EEM, we need a metric to measure on prediction level. Therefore, the TOP-N Recall@K is proposed to allow top N scored predicates in one prediction as candidates, then, \(N\times K\) number of candidates are used for calculating Recall, i.e.,

$$\begin{aligned} \text {TOP-N Recall@K} = \frac{{correct}(\text {{\{ candidates\}}}_{N\times K})}{N_{gt}}, \end{aligned}$$

when N=1, TOP-N Recall@K is equal to Recall@K.

Table 3 Ablation studies of three modules on BGNN+BLS
Table 4 Comparison of EEM and FREQ on baseline Motifs

Implementation Details We use pretrained Faster R-CNN [45] with backbone ResNeXt-101-FPN [46] as object detector. ROIAlign’s [46] resolution is 7. We froze its weight during training. We use pretrained GloVe [47] weight as initial \(w_{s},w_{o},w_{e}\) in semantic extractor. For EEM, we use 3-layers MLP \(\Phi _{s},\Phi _{o},\Phi _{p}\) with 1024 neurons. \(\varphi \) is a 2-layers MLP with hidden dimension 4096. \(\mu \) in Eq. 16 is set to 0.3 for unbiased methods, and 0.7 for baseline models. For Eq. 4, we choose two \(3\times 3\) convolution layers to generate a 256-dim attention vector. For LCM, We choose the context encoder with 4 layers and 8 heads, entity dimension \(d=512\). For training, approximately 10000 iterations are enough for each baseline. The basic learning rate is 0.01 and the batch size is 16. We choose the SGD optimizer for optimization.

4.2 Comparison with baseline models

Table 1 shows mRecall & Recall of 5 baseline models with or without our framework LANDMARK. Baseline models include GCN-based models like G-RCNN [30], and BGNN [9], context modeling networks like IMP [44], Transformer [26], and Motifs [17]. It is worth mentioning that we do not deploy any unbiased strategies on these models. We observe that incorporating our proposed LANDMARK leads to a consistent mRecall improvement in all three tasks for all baseline models, which demonstrates the robustness of our approach. For mR@100, our model average improvements are 3.66%, 2.64%, and 1.37% on three tasks. The improvements might be attributed to the fact that multi-semantic language representations indeed facilitate visual representations. It is not surprising that improvement consecutively shrinks in three tasks, due to inaccurate class and position predicted by pretrained object detector. Besides, Recall shows drops to different extents, which is a common characteristic of an unbiased method.

We also measure the total parameters (M) and GFLOPs in Table 1. There are \( \sim \)20M parameters increasing and \( \sim \)2.5GFLOPs additional computation cost when adding LANDMARK. The relative proportions are about 6% and 1.2%, respectively. It is attributed to lightweight module design and adoption of low-dimensional inputs (i.e., language rather than image inputs.)

4.3 Compatibility with unbiased methods

A tricky problem of unbiased SGG strategies is that most of them have demanding working conditions, for example, the baseline model’s performance or sample distribution. Hence, we test our framework’s compatibility with other unbiased methods by stacking two methods together, the mRecall & Recall are listed in Table 2. The listed strategies belong to different types, e.g., data resampling: BLS [9], logit manipulation: TDE [15], and feature refinement: LANDMARK (ours). According to this table, there are several findings:

  • Applying LANDMARK with other methods is effective. For instance, BLS with LANDMARK on BGNN gets new SOTA performance. The reason has two: 1) LANDMARK has a distinctive semantic feature enhancement strategy, which does not conflict with other methods. 2) Most of the unbiased methods are designed to obtain priors from biased prerequisites (e.g., baseline, long-tailed dataset), whereas LANDMARK is model-agnostic, that is, baseline does not affect LANDMARK’s inference.

  • Only a tiny improvement, or even decrease in mR@K occurred when using BLS and TDE together. For example, Motif+BLS+TDE results in an obvious decrease in mR@k. This suggests that these methods are sensitive to changes in external circumstances (e.g., sampling distribution, baseline model capability).

  • Our network does not sacrifice Recall a lot. BLS+LAND MARK on Motifs has a higher Recall than the one without ours. By contrast, using BLS+TDE remarkably impair the Recall performance. We speculate that existing unbiased methods do not explore real discrepancies between predicates, but only increase the likelihood of tail predicates.

4.4 Predicate analysis

Shown in Fig. 5, we present the number of data samples in VG dataset and Recall@100 improvements of LANDMARK. In 50 predicates, only 11 predicates predicted by baseline are superior to LANDMARK. Though some predicates have obvious decreases, e.g., “wearing”. It is compensated by the increase of “wears”. Besides, there are 5 hard-to-predict predicates (i.e. belong to, walking in, mounted on, made of, says) that are recalled by LANDMARK.

For exploring the correlation between Recall improvements and data distribution, Pearson Correlation Coefficient (PCC) is used. PCC= -0.32 shows a weak negative correlation between dataset bias and LANDMARK improvements. It should be noticed that some unbiased methods overfitting to tail classes so that Recall improvements show a strong negative correlation with the number of samples (e.g., TDE: PCC=-0.56). Therefore, LANDMARK is robust to data distribution.

Fig. 6
figure 6

Visualization of LAM. We visualize the union area by giving ground truth subject-object pair and random generated pair (red words), which shows the connection between word pairs and particular visual features

Fig. 7
figure 7

mR@100 of SGG models with different model sizes (1,3,4,6 layers of LCM) on SGDet task

4.5 Ablation studies

We investigate each LANDMARK component by incrementally adding EEM, LAM, and LCM to the BGNN+BLS in Table 3. The results indicate that: 1) each component is helpful for the whole framework, and no conflict between them. Improvements show that three modules extract distinctive semantics from the same label inputs. 2) EEM mainly improves PredCls more than other tasks, which might caused by inaccurate position and object label predictions in last two tasks. 3) LCM and LAM consistently promote performances of each task. Because priors from language context and correlation of word-visual patterns are relatively robust to misclassified but similar semantic object labels.

4.6 Analysis of experience estimation module

As mentioned before, Experience Estimation Module independently outputs predicate predictions like Frequency Baseline (FREQ) [17]. Therefore, we evaluate TOP-N Recall of EEM and FREQ trained on Motifs in Table 4. Except for Top-5 on PredCls task, all performances of EEM outperform FREQ, and the gap enlarged along with task difficulty increased. It is attributed to supervision from joint possibility that alleviates the deficiency of rare subject-object samples. Besides, position information introduced in EEM makes it accurate when inferencing the same object pair.

Fig. 8
figure 8

mR@100 performance of SGG models with different \(\mu \) factor on SGDet task

4.7 Analysis of Language Attention Module

To validate LAM’s effectiveness, we visualize heatmaps of relation representation \(\hat{e}_{ij}\) (Eq. 5) generated by BGNN+BLS+ LANDMARK with ground truth or random generated subject-object pair (with red words) in Fig. 6. Intuitively, we can notice that attention area is correlated with a given subject and object. For instance, in the leftmost two images, the attention area transfers from the foot to the middle of the man’s body when the object changes from shoe to shirt. While given irrelevant words in the rightmost two images, e.g., clock-snow, this module seems to be interested in the top left and bottom area, which suggests LAM can associate words to related visual pattern without given coordinates.

Fig. 9
figure 9

Visualization Results: In three columns, we present scene graphs generated by annotations, BGNN with BLS (unbiased model), and LANDMARK, respectively. Relations and entities that neither occur in annotations nor baseline are marked with orange and purple, respectively

4.8 Analysis of Language Context Module

We test mRecall@100 performance of the different numbers of LCM transformer layers in Fig. 7. For each model, we record 1,3,4,6-layer LCM’s performance and corresponding parameters. The figure shows mRecall has a noticeable boost with the number of layers increasing from 1 to 4. However, the 6-layer structure could not bring sufficient performance improvement in consideration of parameters increase. Therefore, 4 layers are adopted for LCM.

4.9 Evaluation of \(\mu \) factor

Figure 8 shows 4 model’s mRecall@100 on SGDet task with different factor \(\mu \) in Eq. 16. We find that for baseline models, factor=0.7 is preferable, and for unbiased methods, 0.3 is better. This indicates that EEM mainly learns diversified predicates.

4.10 Qualitative studies

We visualize scene graph generation result from annotations, BGNN with BLS, and BGNN+BLS+LANDMARK on PredCls task in Fig. 9. Intuitively, annotations and BGNN+BLS tend to predict relationships between “less informative pairs” (e.g., person-eye, roof-building and light-bus). However, LANDMARK can further detect relationships between man-sidewalk or roof-bus. Besides, LANDMARK focuses on high-level semantic and positional relationships. For instance, “hand hold pizza” and “light on back of bus” prove that our framework successfully learns the position relationships between objects.

5 Conclusion

In this paper, we first point out inadequate language modality utilization in precious SGG methods. Motivated by language’s polysemy, we purpose a representation enhancement framework (LANDMARK) for the SGG task, featured by multi-semantic extraction from object labels. This plug-in network explores word-vision correlated patterns and language context from word embedding and learns predicate distribution from subject-object pair with the position. Compared with other unbiased methods, our framework is a new approach from the representation refinement perspective. Experiment and analysis show consistent improvement in Baseline models and great compatibility with other unbiased methods.