Keywords

1 Introduction

Coined by James J. Gibson in [3], affordances refer to the action possibilities offered by the environment to an agent. He claimed that living beings perceive their environment in terms of such affordances.

Fig. 1.
figure 1

Trained in a one-shot manner, our approach detects human affordances and hallucinates the associated human bodies interacting with the environment in a natural and physically plausible way

An artificial agent with object, semantics and human affordances detection capabilities would be able to identify elements, their relations and the locations in the environment that support the execution of actions like stand-able, walk-able, place-able, and sit-able. This enhanced scene understanding is helpful in the Metaverse, where virtual agents should execute actions or where scenes must be populated by humans performing a given set of interactions.

We present a direct representation of human affordances that extracts a meaningful geometrical description through analysing proximity zones and clearance space between interacting entities in human-environment configurations. Our approach can determine locations in the environment that support them and generate natural and physically plausible 3D representations (see Fig. 1). We compare our method with state-of-the-art intensively trained methods.

2 Related Work

Popular interpretations of the concept of affordances refer to them as action possibilities or opportunities of interaction for an agent/animal that are perceived directly from the shape and form of the environment/object.

The affordances detection from RGB images was explored by Gupta et al. [4] with a voxelised geometric estimator. Lately, data-intensive approaches were used by Fouhey et al. [2] with a detector trained with labels on RGB frames from the NYUv2 dataset [13] and by Luddecke et al. [9] with a residual neural network trained with a lookup table between affordances and objects parts on the ADE20K dataset [20].

Other approaches go further by synthesising the detected human-environment interaction. The representation of such interactions has been showcased with human skeletons in [7, 8, 15]; nevertheless, their representativeness cannot be reliably evaluated because contacts, collisions, and the naturalness of human poses are not entirely characterised.

Closer to us, efforts with a more complex interaction representation over 3D scenes have been explored. Ruiz and Mayol [12] developed a geometric interaction descriptor for non-articulated, rigid object shapes with good generalisation in detecting physically feasible interaction configurations. Using the SMPL-X human body model [10], Zhang et al. [18] developed a context-aware human body generator that learnt the distribution of 3D human poses conditioned on the depth and semantics of the scene from recordings in the PROX dataset [5]. In a follow-up effort, Zhang et al. [17] developed a purely geometric approach to model human-scene interactions by explicitly encoding the proximity between the body and the environment, thus only requiring a mesh as input. Lately, Hassan et al. in [6] learnt the distribution of contact zones in human body poses and used them to find environment locations that better support them.

Our main difference from [5, 6, 17, 18] is that ours is not a data-driven approach; ours does not require the use of most, if not all, of a labelled dataset, e.g. around 100K image frames in PROX [5]. Just one if not a few examples of interactions are necessary to train our detector, as in [12], but we extend the descriptor to consider the clearance space of the interactions and their uses and optimise with the SMPL-X human model after positive detection.

Fig. 2.
figure 2

Illustrative 2D representation of our training pipeline. (a) Given a posed human body \(M_h\) interacting with an environment \(M_e\) on a reference point \(p_{train}\), (b) we extract the Interaction Bisector Surface from the Voronoi diagram of sample points on \(M_h\) and \(M_e\), then (c) we use the IBS to characterise the proximity zones and the surrounding space with provenance and clearance vectors

3 Method

3.1 A Spatial Descriptor for Spatial Interactions

Inspired by recently developed methods that have revisited geometric features such as the bisector surface for scene-object indexing [19] and affordance detection [12], our affordance descriptor (see Fig. 2) expands on the Interaction Bisector Surface (IBS) [19], an approximation of the well-known Bisector Surface (BS) [11]. Given two surfaces \(S_1, S_2 \in \mathbb {R}^3\), the BS is the set of sphere centres that touch both surfaces at one point each.

Our one-shot training process requires 3-tuples (\(M_h\), \(M_e\), \(p_{train}\)), where \(M_h\) is a posed human body mesh, \(M_e\) is an environment mesh and \(p_{train}\) is a reference point on \(M_e\) where the interaction is supported.

Let \(P_h\) and \(P_e\) denote the sampling points on \(M_h\) and \(M_e\), respectively; their IBS \(\mathcal {I}\) is defined as:

$$\begin{aligned} \mathcal {I}&= \big \{p \mid \min _{ p'_h \in P_h} \Vert p-p'\Vert =\min _{ p'_e \in P_e} \Vert p-p'\Vert \big \} \end{aligned}$$
(1)

We operate the Voronoi diagram \(\mathcal {D}\) generated with \(P_h\) and \(P_e\) to produce \(\mathcal {I}\). By construction, every ridge in \(\mathcal {D}\) is equidistant to the couple of points that define it. Then, \(\mathcal {I}\) is composed of ridges in \(\mathcal {D}\) generated because of points from both \(P_h\) and \(P_e\). An IBS can reach infinity, but we limit \(\mathcal {I}\) by clipping it with the bounding sphere of \(M_h\) augmented \(ibs_{rf}\) times in its radius. A low sampling rate degenerates on an IBS that pierces the boundaries of \(M_h\) or \(M_e\). A higher density of samples is critical in those zones where the proximity between the interacting meshes is small. We use three stages to populate \(P_h\) and \(P_e\): 1) We generate Poisson disk sample sets [16] of \(ibs_{ini}\) points on each \(M_e\) and \(M_h\). 2) Counterpart sampling strategy. We append to \(P_e\) the closest points on \(M_e\) to elements in \(P_h\), and equally, we integrate into \(P_h\) the closest point on \(M_h\) to samples in \(P_e\). We executed the counterpart sampling strategy \(ibs_{cs}\) times. 3) Collision point sampling strategy. We calculate a preliminary IBS and test it for collisions with \(M_h\) and \(M_e\); if they exist, we add as samples the points where collisions occur as well as their counterpart points. We perform the collision point sampling strategy until we get an IBS that does not pierce \(M_h\) nor \(M_e\).

To capture the regions of interaction proximity on our enhanced IBS, as mentioned above, we use the notion of provenance vectors [12]. The provenance vectors of an interaction start from any point on \(\mathcal {I}\) and finish at the nearest point on \(M_e\). Formally:

$$\begin{aligned} V_p = \big \{(a, \vec {v}) \mid a \in \mathcal {I} ,\ \vec {v} = \underset{e \in M_e}{\arg \min }\ \Vert e - a \Vert - a \big \} \end{aligned}$$
(2)

where a is the starting point of the delta vector \(\vec {v}\) to the nearest point on \(M_e\). Provenance vectors inform about the direction and distance of the interaction; the smaller the vector, the more noteworthy is for the description of the interaction. Let \(V'_p \subset V_p\) the subset of provenance vectors that ends at any point in \(P_e\); we perform a weighted randomised sampling selection of elements from \(V'_p\) with the weight allocation as follows:

$$\begin{aligned} w_i = 1-\frac{|\vec {v}_i|- |\vec {v}_{min}|}{|\vec {v}_{max}|- |\vec {v}_{min}|},\ i=1,\ 2,\ \dots ,\ |P_e| \end{aligned}$$
(3)

where \(|\vec {v}_{max}|\) and \(|\vec {v}_{min}|\) are the norms of the biggest and smallest vectors in \(V'_p\) respectively. The selected provenance vectors \(\mathcal {V}_{train}\) integrate into our affordance descriptor with an adjustment to normalise their positions with the defined reference point \(p_{train}\):

$$\begin{aligned} \mathcal {V}_{train} = \big \{(a'_i, \vec {v}_i) \mid a'_i = a_i - p_{train}, \ i= 1,\ 2,\ \dots ,\ num_{pv} \big \} \end{aligned}$$
(4)

where \(num_{pv}\) is the number of samples from \(V'_p\) to integrate into our descriptor.

However, the provenance vectors are, on their own, insufficient to capture the whole nature of the interaction on highly articulated objects such as the human body. We expand this concept by taking a more comprehensive description that includes a set of vectors to define the surrounded space necessary for the interaction. Given \(S_H\) an evenly sampled set of \(num_{cv}\) points on \(M_h\), the clearance vectors that integrate to our descriptor \(\mathcal {C}_{train}\) are defined as follows:

$$\begin{aligned} \mathcal {C}_{train} = \big \{(s'_j, \vec {c}_j) \mid s'_j=s_j - p_{train},\ s_j \in S_H,\ \vec {c}_j = \psi (s_j,\ \hat{n}_j,\ \mathcal {I}) \big \} \end{aligned}$$
(5)
$$\begin{aligned} \psi (s'_j, \hat{n}_j, \mathcal {I}) = {\left\{ \begin{array}{ll} d_{max} \cdot \hat{n}_j &{} \text {if } \varphi (s_j,\ \hat{n}_j,\ \mathcal {I})>d_{max}\\ \varphi (s_j, \hat{n}_j, \mathcal {I}) \cdot \hat{n}_j &{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(6)

where \(p_{train}\) is the defined reference point, \(\hat{n}_i\) is the unit surface normal vector on sample \(s_j\), \(d_{max}\) is the maximum norm of any \(\vec {c}_j\), and \(\varphi (s_j,\ \hat{n}_j,\ \mathcal {I})\) is the distance travelled by a ray with origin \(s_j\) and direction \(\hat{n}_i\) until collision with \(\mathcal {I}\).

Formally, our affordances descriptor is defined as:

$$\begin{aligned} f:(M_h, M_e, p_{train}) \longrightarrow (\mathcal {V}_{train},\mathcal {C}_{train}, \hat{n}_{train}) \end{aligned}$$
(7)

where \(\hat{n}_{train}\) is the unit surface normal vector of \(M_e\) at \(p_{train}\).

Fig. 3.
figure 3

We determine supportability of interaction on a given point by (a) measuring compatibility of surface normal, as well as provenance and clearance vector over different rotated configurations. (b) After a positive detection, the body pose is optimised to generate a natural and physically plausible interaction

3.2 Human Affordances Detection

Let \(\mathcal {A}=(\mathcal {V}_{train}, \mathcal {C}_{train}, \hat{n}_{train})\) an affordance descriptor, we define its rigid transformations as:

$$\begin{aligned} \varOmega (\mathcal {A}, \phi , \tau )&= (\mathcal {V}^{A}_{\phi \tau },\ \mathcal {C}^{A}_{\phi \tau },\ \hat{n}_{train}) \\ \mathcal {V}^{A}_{\phi \tau }&= \big \{ (a''_i, \vec {v}_i)\ |\ a''_i=R_{\phi }\cdot a'_i+\tau \ ,\ (a'_i,\ \vec {v}_i) \in \mathcal {V}_{train} \big \} \nonumber \\ \mathcal {C}^{A}_{\phi \tau }&= \big \{(s''_i, \vec {c}_i)\ |\ b''_i=R_{\phi }\cdot s'_i+\tau \ ,\ (s'_i,\ \vec {c}_i) \in \mathcal {C}_{train} \big \} \nonumber \end{aligned}$$
(8)

where \(\tau \in \mathbb {R}^3\) is the translation vector, \(\phi \) is the rotation around z defined by the rotation matrix \(R_\phi \).

We determine that a test location \(p_{test}\) on an environment \(M_{test}\) with a unit surface normal vector \(\hat{n}_{test}\) supports a trained interaction \(\mathcal {A}\) if the angle difference between \(\hat{n}_{train}\) and \(\hat{n}_{test}\) is less than a threshold \(\rho _{\vec {n}}\), and its translated descriptor at \(p_{test}\) has a good alignment of provenance vectors and a gated number of clearance vector that collide with \(M_{test}\) in any of the \(n_{\phi }\) different \(\phi \) values used during the test.

After corroborating the match between train and test normal vectors, we transform the interaction descriptor \(\mathcal {A}\) with \(\tau =p_{test}\) and \(n_{\phi }\) different \(\phi \) values within \([0,2\pi ]\). For each calculated 3-tuple \((\mathcal {V}^{A}_{\phi \tau },\ \mathcal {C}^{A}_{\phi \tau },\ \hat{n}_{train})\), we generate a set of rays \(R_{pv}\) defined as follows:

$$\begin{aligned} R_{pv}=\Big \{(a''_i, \hat{\nu }_i)\ |\ \hat{\nu }_i = \frac{\vec {v}{i}}{ \Vert \vec {v}{i}\Vert },\ (a''_i, \vec {v}_i) \in \mathcal {V}^{A}_{\phi \tau } \Big \} \end{aligned}$$
(9)

where \(a''_i\) is the starting point, and \(\hat{\nu }_i \in \mathbb {R}^3\) is the direction of each ray. Then we extend each ray in \(R_{pv}\) by \(\epsilon ^{pv}_i\) until collision with \(M_{test}\) as

$$\begin{aligned} (a''+\epsilon ^{pv}_i \cdot \hat{\nu }_i) \in M_{test}, \ \ i=1,2,\dots ,num_{pv} \end{aligned}$$
(10)

and compare with the magnitude of each correspondent provenance vector in \(\mathcal {V}^{A}_{\phi \tau }\). When any element in \(R_{pv}\) extends beyond a predetermined limit \(max_{long}\), the collision with the environment is classified as non-colliding. We calculate the alignment score \(\kappa \) as a sum of the differences between the extended rays and the trained provenance vectors with

$$\begin{aligned} \kappa = \sum _{\forall i | \epsilon ^{pv}_i \le max_{long}} |\epsilon ^{pv}_i- \vec {v}_i | \end{aligned}$$
(11)

The higher the \(\kappa \) value, the less supportability of the interaction on \(p_{test}\). We experimentally determine interaction-wise thresholds for the sum of differences \(max_\kappa \) and the number of missing ray collisions \(max_{missings}\) that allow us to score the affordance capabilities on \(p_{test}\).

Clearance vectors are meant to fast detect collision configurations by the calculation of ray-mesh intersections. Similarly to provenance vectors, we generate a set of rays \(R_{cv}\) with origins and directions determined by \(\mathcal {C}^{A}_{\phi \tau }\). We extend the rays in \(R_{cv}\) until collision with the environment and calculate its extension \(\epsilon ^{cv}_j\). Extended rays with \(\epsilon ^{cv}_j \le \Vert \vec {c}_j\Vert \) are considered as possible collisions. In practise, we also track an interaction-wise threshold to refuse supportability due to collisions \(max_{collisions}\). A sparse distribution of clearance vectors on noisy meshes results in collisions not detected by clearance vectors. To improve, we enhance scenes with a set of spherical fillers that pad the scene (see Fig. 3).

Every human-environment interaction trained from the PROX dataset [5] has an associated SMPL-X characterisation that we use to optimise the human pose with previously determined body contact regions, the AdvOptim loss function presented in [17] and the SDF values of the scene.

Fig. 4.
figure 4

Action planning as a further step. Left: 3 affordances evaluated in an environment. Right: scores are used to plan concatenated action milestones

4 Experiments

We evaluate the physical plausibility and the perception of the naturalness of the human-environment interactions generated. Our baselines are the approaches presented in PLACE [17] and POSA [6].

PROX [5] is a dataset with 12 scanned indoor environments and 20 recordings with data of subjects interacting within them. We divide PROX into train and test sets following the setup in [17]. To generate our descriptors, we get data from 23 manually selected frames with subjects sitting, standing, reaching, lying, and walking. We also test on 7 rooms from MP3D [1] and 5 rooms of Replica [14].

We generate the IBS surface \(\mathcal {I}\) with an initial sampling set of \(ibs_{ini}=400\) points on each surface, with the counterpart sampling strategy executed \(ibs_{cs}=4\) times and a cropping factor of \(ibs_{rf}=1.2\). Our descriptors are made up of \(num_{pv}=512\) provenance vectors and \(num_{cv}=256\) clearance vectors extended up to \(d_{max}=5[cm]\). In testing, we use a normals angle difference threshold of \(\rho _{\vec {n}} = \pi /3\), check for supportability on \(n_{\phi }=8 \) different directions and extend provenance vectors up to \(max_{long}=1.2\) times the sphere radius used for cropping \(\mathcal {I}\) during training.

Physical Plausibility Test. We use the non-collision and contact scores as in [17], but include an additional cost metric that indicates the collision depth between the generated body and the scene. We generate 1300 interacting bodies per model in each scene and report the averages of the scores in Table 1. In all datasets, bodies generated with our optimised model present high non-collision as well as low contact and collision-depth scores.

Perception of Naturalness Test. Every scene in our datasets is used equally in the random selection of 162 test locations. We use the optimised version of the models to generate human-environment interactions at test locations and evaluate their perceived naturalness on Amazon Mechanical Turk. Each MTurk performs 11 randomly selected assessments, including two control questions, by observing interactions with dynamic views. Three different MTurks evaluate every item. In a side-by-side evaluation, we simultaneously present outputs from two different models. Answers to “Which example is more natural?” show that our human-environment configurations are preferred on 60.7% and 72.6% of the comparisons with PLACE and POSA, respectively. In an individual evaluation, where every interaction generated is assessed with the question “The human is interacting very naturally with the scene. What is your opinion?” using a 5-point Likert scale (from 1 for “strongly disagree” to 5 for “strongly agree”), the mean and standard deviations of the evaluations are: PLACE 3.23 ± 1.35, POSA 2.79 ± 1.18, and ours 3.39 ± 1.25.

Table 1. Physical plausibility. Non collision, contact and collision depth scores (\(^\uparrow \): benefit, \(^\downarrow \): cost) before and after optimization. Best results boldface

5 Conclusion

Our approach generalises well to detect interactions and generate natural and physically plausible body-scene configurations. Understanding a scene in terms of action possibilities is a desirable capability for autonomous agents performing in the Metaverse (see Fig. 4).