Exploring Pairwise Spatial Relationships for Actions Recognition and Scene Graph Generation

Amirat, Anfel; Baha, Nadia; Benrais, Lamine

doi:10.1007/978-3-031-34111-3_32

Anfel Amirat¹⁹,
Nadia Baha¹⁹ &
Lamine Benrais²⁰

Part of the book series: IFIP Advances in Information and Communication Technology ((IFIPAICT,volume 675))

Included in the following conference series:

IFIP International Conference on Artificial Intelligence Applications and Innovations

1031 Accesses
1 Altmetric

Abstract

Visual scene understanding is a fundamental problem and a complex task in computer vision, which not only requires identifying objects in isolation, but also the ability to understand and recognize the relationships between them. These relationships can be abstracted into a semantic representation of $<subject, predicate, object>$, resulting in a scene graph that captures much of the visual information and semantics in the scene. In recent years, scene graph generation with message-passing mechanism [1] has been an active area of research, as it has the potential to capture global dependencies between objects and their relationships. Inspired by these developments, this paper introduces a novel scene graph generation approach based on spatial relationships. Our approach performs a classification of the spatial relationship between each pair of objects to generate the initial scene graph. Then, based on the semantic features, the model detects action relationships in the scene and updates the scene graph by applying the message-passing mechanism. We conclude this paper by comparing the proposed method with the state-of-the-art approaches [1,2,3,4,5,6,7] and demonstrate the effectiveness of our method over the Visual Genome [1] dataset.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Independent Relationship Detection for Real-Time Scene Graph Generation

Scene Graph Generation Based on Node-Relation Context Module

Compact-VG: A Small-scale Dataset for Scene Graph Generation

Keywords

1 Introduction

A scene graph is a structured representation of image content that encodes spatial and semantic information of each object and the relationship between each pair of them. Recently, inferring such a graph has gained more attention since it provides a deep understanding of the scene and improves various vision tasks such as Image Retrieval [8, 9], Image Generation [10, 11], Image/Video Captioning [12, 13], and Visual Question Answering [14, 15].

The major challenge of generating scene graphs is reasoning about relationships. Earlier works [16, 17] aimed to produce a local prediction of object relationships in order to simplify the process of generating visually-grounded scene graphs. The approach was to independently predict relationships between pairs of objects without considering the scene’s context. In contrast, co-reasoning with contextual information could often resolve the ambiguity due to local predictions in isolation [18].

Message passing between individual objects or triplet is valuable for visual relationship detection [18]. Since objects with visual relationships are semantically related to each other, and relationships that share objects partially also have semantic relations, message passing between related elements is beneficial as it can improve the quality of visual relationship detection [2]. However, this mechanism is expensive and requires much computation time due to the numerous features to handle [19]. Moreover, visual appearance of the same relation varies significantly from one scene to another [20], making the features extraction phase more challenging. Thus, many methods focus on semantic features [21], trying to compensate for the lack of visual features.

To address these challenges and overcome the obstacle of variability in visual appearance, this work proposes a novel message-passing approach based on pairwise semantic spatial relationships. The concept is to replicate the human capacity to predict the relations between objects in a scene using their pairwise semantic spatial relationships.

In this paper, we first review past works related to message-passing scene graph generation and spatial relationships classification. Then, we introduce the proposed method in Sect. 3. In Sect. 4, the experimental results are shown and discussed. Finally, Sect. 5 concludes the paper by summarizing the obtained results.

2 Related Work

To contextualize our approach and evaluate its performance against the existing methods, we review the related work on message-passing scene graph generation and spatial relationships applications.

2.1 Message Passing

There are three levels to understanding and perceiving the context [18]: first, the interdependence between the different phrase components in a triplet is fundamental, the prediction of one component, such as the subject, predicate, or object, depends on the others. Second, triplets are not isolated, objects with relations are semantically dependent, and the relations that partly share object(s) are also semantically linked. Third, Visual relationships are specific to the scene, and global view features help predict relationships. Hence, message passing between objects and triplets is significant in detecting visual relationships.

The literature divides message-passing technique into two types:

Local Message Passing Within Triplet. Li et al. [22] proposed a phrase-guided visual relationship detection framework that first extracts three feature branches for each triplet proposal (subject, predicate, and object). Then, it uses a phrase-guided message-passing structure to exchange information between the three branches. Dai et al. [23] proposed an efficient framework known as the Deep Relational Network (DR-Net). By using multiple units of inference that capture the statistical relationships between triplet components, the DR-Net produces the posterior probabilities of the subject, object, and relationship. Zoom-Net [2] is another interesting model. It uses a Spatiality-Context-Appearance Module consisting of 2 spatiality-aware feature alignment cells to pass messages between the different triplet components. This type of message passing ignores the global context, whereas joint reasoning using contextual information can often resolve ambiguities caused by isolated local predictions [18].

Global Message Passing Across All Elements. Li et al. [3] developed a Multi-level Scene Description Network (MSDN) in which the passage of the message is guided by a dynamic graph constructed from objects and caption region proposals. F-Net, proposed by Li et al. [24], clusters the fully-connected graph into several subgraphs. Next, it uses a Spatial-weight Message Passing structure for passing messages between subgraph and object features. MSDN and F-Net considered a subgraph as a whole when sending and receiving messages. Liao et al. [25] proposed semantics-guided graph relation neural network (SGRNN). In their approach, the target and the source must be an object or a predicate within a subgraph. When considering all other objects as carriers of global contextual information for each object, they will pass messages to each other throughout a fully-connected graph. However, propagating many types of features and inferencing on a densely connected graph is very expensive and time-consuming to train [19].

2.2 Spatial Pairwise Relationships

Apprehending the spatial relationships between objects and how they are positioned and related to one another is imperative for a deep understanding of the scene. The application of spatial relation detection is useful in visually situated dialog and Human-robot interaction. For example, when instructing a robot in a household environment to accomplish a specific task [26] or when self-driving cars are designed to provide a textual explanation for their actions [27]. Likewise, the explicit use of spatial prepositions is also helpful in automatic image captioning [28].

For the proposed approach, we decide to stimulate the human capacity to infer much information by knowing the spatial relations between the different objects in the scene to detect and infer activities and action relations between image entities.

In this work, we propose a novel approach for scene graph generation based on the global message-passing mechanism. By incorporating semantic spatial relationships, our approach aims to overcome the challenge of variability in visual appearance and make more robust predictions about the relationships between objects in a scene.

3 Proposed Method

The proposed approach for scene graph generation is divided into pairwise spatial relationships classifications and scene graph update, as Fig. 1 shows. Our model tackles the visually-grounded scene graph generation from an image by generating a graph with a spatial relationship between each object pair. Then, recognize the action relationship and update the scene graph by applying the message-passing mechanism using only semantic features (objects and spatial relations labels). To achieve this, we use two neural network architectures that focus on each task independently and stack both architectures together once they have been trained. We use ground truth objects for object detection and recognition to evaluate the approach appropriately.

Before delving into the proposed model, we describe the scene graph structure. Formally, a scene graph is a structured representation of a scene’s content. It comprises the objects’ labels with bounding box coordinates and the relationship between each object pair.

A scene graph is defined as a 3-tuple set $G=\{B,O,R\}$:

$B=\{b_1,b_2,...,b_n\}$ is the bounding box set, $b_i \in R^4$ corresponds to the bounding box of the $i^{th}$ region.

$O=\{o_1,o_2,...,o_n\}$ object’s label set, $o_i$ corresponds to the label class of the region $b_i$.

$R=\{r_{1\rightarrow 2},r_{1\rightarrow 3},...,r_{n\rightarrow n-1}\}$ relationship triplet set, where $r_{i\rightarrow j}$ is a triplet of the object $(o_j,b_j)$, the subject $(o_i,b_i)$, and the relationship class $a_{i \rightarrow j}$.

3.1 Pairwise Spatial Relationships Classifications

Features Extraction. This module aims to get the objects’ appearance, semantic cues, and relative spatial locations between pairwise objects. This approach is inspired by [29] to extract three types of features to classify the semantic spatial relationship between each object’s pair in the scene. Geometric Features: we exploit the spatial contextual information from the subject, object, union, and intersection boxes. For each box $(x_1,y_1,x_2,y_2)$, a 9-dimensional vector is calculated as (1) shows:

$$\begin{aligned} V=(\frac{c_x}{W},\frac{c_y}{H},\frac{w}{W},\frac{h}{H},\frac{x_1}{W},\frac{h_1}{H},\frac{x_2}{W},\frac{h_2}{H},\frac{w*h}{W*H}) \end{aligned}$$

(1)

where $(c_x,c_y)=(\frac{x_1+x_2}{2},\frac{y_1+y_2}{2}) $ is the box’s centroid, $(w,h)=(x_2-x_1,y_2-y_1)$ denotes the width and the height of the box, and $(W,H)$ the width and the height of the image. For an empty intersection box, a zero vector represents the intersection box’s geometric features. Then, all four vectors are concatenated to compose the geometric features. Appearance Features: for the subject bounding box region, object bounding box region, union box, and intersection box, we use the FC7 layer from VGG16 [30] pre-trained on ImageNet [31] to extract the appearance feature vector (4096-d). For an empty intersection box, a zero vector represents the intersection box’s appreance features. Then we concatenate all four vectors to compose the appearance features of the spatial relationship. Semantic Features: glove [32] is used as a word embedding engine to encode objects’ label names for the subject and the object. For phrase names, the mean vector is calculated. By concatenating the two encoded name features, the semantic relation features are composed.

Finally, the relation features are obtained by concatenating geometric, appearance, and semantic features.

Spatial Relationship Classification. After concatenating the extracted features described in 3.1 for each object’s pair, we feed them to a multilayer perceptron neural network architecture (MLP) to classify the spatial relationships. Then, the initial scene graph with only pairwise spatial relationships is generated.

3.2 Scene Graph Update

We aim to update the scene graph relationships generated in 3.1 by applying the message-passing mechanism to have more meaningful semantic information with activities and action relationships.

Action Relationship Recognition. This step aimes to update edge representation while keeping node representations constant by using a variant of GGNN [33] to propagate information among edges. For each edge $a_{s\rightarrow o}$, three steps, as Fig. 2 shows, are needed: pass preparation, information aggregation, and edge update. Pass Preparation: for each node from the subject node $(o_s,b_s)$, and the object $(o_o,b_o)$, its set of neighbors ${(o_i,b_j)}$ is selected . Information Aggregation: for each node from subject node $(o_s,b_s)$ and object node $(o_o,b_o)$, information is summarized by computing incoming information from its neighbors as shown in (2) :

$$\begin{aligned} m_k=o_k+\sum a_{i\rightarrow k}\cdot o_i -\sum a_{k\rightarrow j}\cdot o_j \end{aligned}$$

(2)

Edge Update: after information aggregation, we concatenate $m_s$ and $m_o$. Then it is passed with the current state $a_{S \rightarrow O}$ to Gated Recurrent Unit (GRU) to update the edge label. Finally, a scene graph with, in addition, pairwise action relationships is obtained.

4 Test and Results

This section presents a details evaluation of the proposed model. First, an evaluation of the spatial relationship classifier is processed. Then, we pass to the model of scene graph generation. Tests are conducted on a personal computer with an i7 processor, 16 GB memory, and a 2 GB Nvidia GPU.

4.1 Pairwise Spatial Relationships Classifications

Dataset. We conduct the experiments and evaluate the Spatial relationships classifier on the SpatialSense dataset [34], a collected benchmark for spatial relation recognition that contains 17498 spatial relations on 11596 images. All images are collected from Flickr and NYU [37]. The annotated spatial relation in the dataset covers 3679 unique object classes and 9 unique predicates (i.e., above, behind, in, in front of, next to, on, to the left of, to the right of, under). The SpatialSense dataset provides positive and negative examples of spatial relationships. To train the spatial relationship classifier, only positive triplets are considered. Following the official split in [34], we take 65% of relations for training, 15% for validation, and 20% for testing.

Evaluation Metric. The proposed classifier’s ability to classify pairwise spatial relationships can be evaluated using classification accuracy [35] as a reliable and fair measure.

Compared with State-of-the-art Methods. We compare our classifier with various recent methods.

Table 1. Classification accuracy comparison on the test split of the SpatialSense dataset (All Values Expressed as Percentages). IFO = in front of, TTFO = to the left of, TTRO = to the right of. Bold font represents the highest accuracy; underline means the second highest.

Full size table

Table 1 shows the performance of different approaches on the SpatialSense dataset. Vip-CNN [22], Peyre et al. [36], PPR-FCN [38], DRNet [23], and VtransE [39], initially designed for visual relationship detection, are based only on visual appearance. Language-only,2D-only, and Language+2D [34], designed for spatial relation recognition, are based on 2D/Language features. Our classifier takes into consideration the three main types of features: appearance features, semantic features, and geometric features. Overall, the results of the accuracy score indicate that our proposed classifier outperforms almost all existing approaches in terms of overall accuracy, except DSRR (by only 1.1%) [40], which exploits depth information with an additional depth estimation model. With the additional depth, we expect our classifier to gain another performance boost and correctly classify complex cases that were previously misclassified, as Fig. 3 shows.

4.2 Scene Graph Generation

After training and testing our classifier for spatial relationships between pairs of objects, this sub-section evaluates the whole scene graph generation process.

Dataset. To evaluate the proposed approach, we use VG150 [1]. It is a widely adopted subset of Visual Genome for evaluating scene graph generation tasks. It contains 108073 images and covers 150 object categories and 50 predicate categories. We follow the same split in [1] for evaluating our approach.

Evaluation Metric. We aim to generate the scene graph for images. The key points are relationship classification and graph generation, while we no longer evaluate the accuracy of object detection or recognition. We evaluate the model performance from the aspect of predicate classification (PredCls) as we use both ground truth boxes and object labels directly. We use R@50 and R@100 to evaluate the performance. R@K computes the fraction of times a true relationship is predicted in an image’s top k confident relation predictions.

Compared with State-of-the-art Methods. We report predicate classification on Visual Genome [1] in Table 2. This experiment is meant to serve as a benchmark against existing message-passing scene graph approaches.

Table 2. Evaluation results of the predicate classification task on the visual Genome dataset [1].

Full size table

The experiments prove the effectiveness of our proposed method. We outperform existing models that use Visual Genome supervision for PredCls by 6,06 recall@50 and 0.51 recall@100. Message Passing [1], and Zoom-Net [2] are local message-passing-based methods. In contrast, the rest are all global message-passing-based methods.

Visual features for the same relation vary greatly from scene to scene, making relation predicting more challenging, especially for rare and unseen configurations and relations. For example, the visual features that represent the “riding” relation between a person and a horse can be very different from one image to another, depending on the pose, the background, the lighting condition, etc. In contrast, considering the semantic pairwise spatial relationships between the objects in the scene, we can infer from “the man on the horse and horse on the grass” that the action relation between man and horse is “riding”. That is why focusing on semantic features like semantic pairwise spatial relationships can improve predicate classification tasks.

5 Conclusion

This paper investigates a novel message-passing scene graph generation approach based on semantic spatial relationships. First, we classify the spatial relationship between each pair of objects in the scene by extracting geometric, appearance, and semantic features and then passing them to an MLP architecture. After, we apply the message-passing mechanism as a second step to detect action relationships and update the scene graph.

Experimental results demonstrate its efficiency and competitiveness compared to the state-of-the-art approaches with 73.09 for R@50 and 78.1 for R@100. However, there are several prospective paths for improving this approach further. Firstly, incorporating additional depth information into the spatial relationship classifier can improve the accuracy and robustness of the model. Moreover, training the spatial relationships classifier on datasets with other spatial relationship classes, such as between, near, and far can be useful in scenes with more diverse spatial configurations. Furthermore, extending the proposed method to work with multi-spatial relations instead of single-spatial relations can boost our model, as it can capture more nuanced relationships between objects.

By incorporating these improvements, the proposed method can be enhanced and upgraded to achieve even better performance.

These prospective paths can be explored in future research and can contribute to advancing the field of scene understanding.

References

Xu, D., Zhu, Y., Choy, C. B., Fei-Fei, L.: Scene graph generation by iterative message passing. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5410–5419 (2017)
Google Scholar
Yin, G., et al.: Zoom-net: Mining deep feature interactions for visual relationship recognition. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 322–338 (2018)
Google Scholar
Li, Y., Ouyang, W., Zhou, B., Wang, K., Wang, X.: Scene graph generation from objects, phrases and region captions. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1261–1270 (2017)
Google Scholar
Li, S., Tang, M., Zhang, J., Jiang, L.: Attentive gated graph neural network for image scene graph generation. Symmetry 12(4), 511 (2020)
Article Google Scholar
Tian, P., Mo, H., Jiang, L.: Exploring correlation of relationship reasoning for scene graph generation. Int. J. Mach. Learn. Cybern. 13(9), 2479–2493 (2022)
Google Scholar
Dornadula, A., Narcomey, A., Krishna, R., Bernstein, M., Li, F.F.: Visual relationships as functions: Enabling few-shot scene graph prediction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pp. 0–0 (2019)
Google Scholar
Liao, W., Lan, C., Zeng, W., Yang, M.Y., Rosenhahn, B.: Exploring the semantics for visual relationship detection. arXiv preprint arXiv:1904.02104 (2019)
Johnson, J., et al.: Image retrieval using scene graphs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3668–3678 (2015)
Google Scholar
Ramnath, S., Saha, A., Chakrabarti, S., Khapra, M.M.: Scene Graph based Image Retrieval-A case study on the CLEVR Dataset. arXiv preprint arXiv:1911.00850 (2019)
Fang, F., Yi, M., Feng, H., Hu, S., Xiao, C.: Narrative collage of image collections by scene graph recombination. IEEE Trans. Visual Comput. Graph. 24(9), 2559–2572 (2017)
Article Google Scholar
Herzig, R., Bar, A., Xu, H., Chechik, G., Darrell, T., Globerson, A.: Learning canonical representations for scene graph to image generation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12371, pp. 210–227. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58574-7_13
Chapter Google Scholar
Gu, J., Joty, S., Cai, J., Zhao, H., Yang, X., Wang, G.: Unpaired image captioning via scene graph alignments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10323–10332 (2019)
Google Scholar
Xu, N., Liu, A.A., Liu, J., Nie, W., Su, Y.: Scene graph captioner: image captioning based on structural visual representation. J. Vis. Commun. Image Represent. 58, 477–485 (2019)
Article Google Scholar
Yang, Z., Qin, Z., Yu, J., Hu, Y.: Scene graph reasoning with prior visual relationship for visual question answering. arXiv preprint arXiv:1812.09681 (2018)
Qian, T., Chen, J., Chen, S., Wu, B., Jiang, Y.G.: Scene graph refinement network for visual question answering. In: IEEE Transactions on Multimedia (2022)
Google Scholar
Lu, C., Krishna, R., Bernstein, M., Fei-Fei, L.: Visual relationship detection with language priors. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 852–869. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_51
Chapter Google Scholar
Yu, R., Li, A., Morariu, V.I., Davis, L.S.: Visual relationship detection with internal and external linguistic knowledge distillation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1974–1982 (2017)
Google Scholar
Zhu, G., et al.: Scene graph generation: A comprehensive survey. arXiv preprint arXiv:2201.00443 (2022)
Zellers, R., Yatskar, M., Thomson, S., Choi, Y.: Neural motifs: Scene graph parsing with global context. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5831–5840 (2018)
Google Scholar
Cong, W., Wang, W., Lee, W.C.: Scene graph generation via conditional random fields. arXiv preprint arXiv:1811.08075 (2018)
Lu, C., Krishna, R., Bernstein, M., Fei-Fei, L.: Visual relationship detection with language priors. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 852–869. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_51
Chapter Google Scholar
Li, Y., Ouyang, W., Wang, X., Tang, X.O.: Vip-cnn: Visual phrase guided convolutional neural network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1347–1356 (2017)
Google Scholar
Dai, B., Zhang, Y., Lin, D.: Detecting visual relationships with deep relational networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3076–3086 (2017)
Google Scholar
Li, Y., Ouyang, W., Zhou, B., Shi, J., Zhang, C., Wang, X.: Factorizable net: an efficient subgraph-based framework for scene graph generation. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 335–351 (2018)
Google Scholar
Liao, W., Lan, C., Zeng, W., Yang, M.Y., Rosenhahn, B.: Exploring the semantics for visual relationship detection. arXiv preprint arXiv:1904.02104 (2019)
Fasola, J., Mataric, M.: Using spatial language to guide and instruct robots in household environments. In: 2012 AAAI Fall Symposium Series (2012)
Google Scholar
Kim, J., Rohrbach, A., Darrell, T., Canny, J., Akata, Z.: Textual explanations for self-driving vehicles. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 563–578 (2018)
Google Scholar
Ghanimifard, M., Dobnik, S.: What goes into a word: generating image descriptions with top-down spatial knowledge. In: Proceedings of the 12th International Conference on Natural Language Generation, pp. 540–551 (2019)
Google Scholar
Zhang, Y., Pan, Y., Yao, T., Huang, R., Mei, T., Chen, C.W.: Boosting scene graph generation with visual relation saliency. ACM Trans. Multimed. Comput. Commun. Appl. 19(1), 1–17 (2023)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. IEEE (2009)
Google Scholar
Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)
Google Scholar
Li, Y., Tarlow, D., Brockschmidt, M., Zemel, R.: Gated graph sequence neural networks. arXiv preprint arXiv:1511.05493 (2015)
Yang, K., Russakovsky, O., Deng, J.: Spatialsense: An adversarially crowdsourced benchmark for spatial relation recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2051–2060 (2019)
Google Scholar
Japkowicz, N., Shah, M.: Evaluating learning algorithms: a classification perspective. Cambridge University Press (2011)
Google Scholar
Peyre, J., Sivic, J., Laptev, I., Schmid, C.: Weakly-supervised learning of visual relations. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5179–5188 (2017)
Google Scholar
Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from rgbd images. In: ECCV (5) 7576, 746–760 (2012)
Google Scholar
Zhuang, B., Liu, L., Shen, C., Reid, I.: Towards context-aware interaction recognition for visual relationship detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 589–598 (2017)
Google Scholar
Zhang, H., Kyaw, Z., Chang, S.F., Chua, T.S.: Visual translation embedding network for visual relation detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5532–5540 (2017)
Google Scholar
Ding, X., Li, Y., Pan, Y., Zeng, D., Yao, T.: Exploring depth information for spatial relation recognition. In: 2020 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), pp. 279–284. IEEE (2020)
Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science Faculty, University of Science and Technology Houari Boumediene, Algiers, Algeria
Anfel Amirat & Nadia Baha
Faculty of Arts, KU Leuven, 3000, Leuven, Belgium
Lamine Benrais

Authors

Anfel Amirat
View author publications
You can also search for this author in PubMed Google Scholar
Nadia Baha
View author publications
You can also search for this author in PubMed Google Scholar
Lamine Benrais
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Anfel Amirat .

Editor information

Editors and Affiliations

University of Piraeus, Piraeus, Greece
Ilias Maglogiannis
Democritus University of Thrace, Xanthi, Greece
Lazaros Iliadis
University of Sunderland, Sunderland, UK
John MacIntyre
University of Leon, León, Spain
Manuel Dominguez

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Amirat, A., Baha, N., Benrais, L. (2023). Exploring Pairwise Spatial Relationships for Actions Recognition and Scene Graph Generation. In: Maglogiannis, I., Iliadis, L., MacIntyre, J., Dominguez, M. (eds) Artificial Intelligence Applications and Innovations. AIAI 2023. IFIP Advances in Information and Communication Technology, vol 675. Springer, Cham. https://doi.org/10.1007/978-3-031-34111-3_32

Download citation

DOI: https://doi.org/10.1007/978-3-031-34111-3_32
Published: 01 June 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-34110-6
Online ISBN: 978-3-031-34111-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Federation for Information Processing (opens in a new tab)