1 Introduction

As robots are evolving to co-exist in human centered environment, they will be expected to understand as well as refine their knowledge about various tasks in day-to-day life. Further, such understandings should enable them to perform those tasks in heterogeneous situations without any need of providing the learning data for each and every situation.

From the perspective of social learning, which in loose sense is “A observes B and then ‘acts’ like B”, in [11], three components have been identified: Goal, Action and Result. Based on what is learned there are basically three categories: mimicking, emulation and imitation. Mimicking is just reproducing the action without any goal. Emulation is regarded as learning causal properties of objects [61], and learning the goal by observation [45, 64], to bring the same result, which might be by different means/actions than the demonstrated one. Imitation [40, 54] is bringing the same result and with same actions. Therefore, emulation involves reproducing the changes in the state of the environment that are the results of the demonstrator’s behavior, whereas imitation involves reproducing the actions that produced those changes in the environment.

Emulation is regarded as an important social learning skill, also among great apes [61] and children [29, 45]. In fact, this facilitates to perform same task in different ways. Some components and evidence of successful emulation during interaction can be traced even at the child’s early development phase. Children are able to show an object to someone in different ways [37]: by pointing, by turning the object, by holding it so that other can see it. Similarly, they are able to hide an object from another person in different ways [25]: by placing a screen between the person and the object, by placing the object itself behind the screen from the person’s perspective. All these suggest that we are able to abstract the ‘desired effect’ to bring, that too by reasoning from the other person’s perspective. In our discussion above, such desired effect were: the object should be visible to the other person for the task of showing; the object should be invisible from the other person’s perspective for the task of hiding. Therefore, for such tasks to be successfully achieved, as a sign of emulation, i.e. bringing the same result, which might be with different means/actions than the demonstrated one, understanding the desired effect and reasoning about perspective are important.

Motivated from these evidences, we also separate emulation and imitation parts during learning by demonstration for the robots. In this paper, we will develop a framework for human-level understanding of effect-based semantics of a task, independent from its execution. Such ‘meaningful’ understanding will also provide flexibility of planning alternatively for a task depending upon the situation, in addition to enriching the natural human–robot interaction.

We hypothesize a task tk as a series of actions A by a set of agents Ag, causing change in the world state from \(WS(t_i)\) to \(WS(t_f)\), similar to [42]. Further, we assume that various facts are inferred continuously during a course of action, and \(IF(t)\) represents the facts inferred at time stamp \(t\). Hence, by observing and analyzing an instance of \(<\!\!\!WI, A, WF\!\!\!>\), (see Fig. 1) various parameters of a task such as preference, desired and undesired changes, trajectory, etc. could be learned. Where \(WI=(WS(t_{i}),IF(t_{i}))\) and \(WF=(WS(t_f),IF(t_f)),\,t_i\) and \(t_f\) are the time stamps corresponding to start and end of A. Here it is important to note that WI and WF are the snapshots at a particular instant, however, as the inference of facts is continuous, it facilitate to assert dynamic and static aspects of the environment, for example “box on table”, “ball moving”, etc. Further, depending upon the level of abstraction, A could be symbolically described as a single action or a series of actions.

Fig. 1
figure 1

\(\left\langle WI, A, WF \right\rangle \) triplet, showing Causal Nature of Environment Change, a set of actions A on initial world WI at time \(t_i\) results into a final world WF at time \(t_f\)

Table 1, gives a general idea about the possible components, which could be inferred and learned by partially or fully observing different components of \(<\) WI, A, WF \(>\) tuple. In this paper our focus is at \(<\) WI, WF \(>\) level, marked as * in Table 1, by reasoning about the effect, assuming positive examples.

Table 1 Observation and learning components correlation

Human-Centered Object Manipulation Tasks (HCOM Tasks) We define the tasks within the context of the paper as human-centered object manipulation tasks. These are the tasks in which one agent (robot or human) is performing some object manipulation task for another person, by considering his/her presence and reasoning about him/her. Hence, such object manipulation tasks require human-centered reasoning, beyond the reasoning about stability of grasp, placement, etc. from an object-centered perspective.

The focus of this paper is the subset of such human-centered object manipulation tasks, which requires pick, place, hold like actions. Therefore, next we will first present the related works in learning pick-and-place type tasks, which addresses one or the other aspect of Table 1, followed by related works on effect based learning approaches. Then we will precise our scope and discuss our motivation to have the hierarchies of facts, which enable comparing as well as qualifying the effects of a task to be learned. Followed by this, the instantiation of hierarchical knowledgebase for our domain will be presented. Then we will present the learning framework, followed by experimental results and analyses for different tasks. The conclusion will be preceded by discussion on potential applications and benefits of such human-level understanding of the tasks.

2 Related Work

Different learning aspects of Table 1 have been addressed by researchers, see [4] for a survey. At the level of trajectory, in [28], the robot learns to pick-and-place with constraints on orientations. In [44], the task of pouring by a human performer has been adapted by the robot for maintaining collision free movement. Similarly, other works adapt the learned trajectory for modified scenarios [7, 21]. Whereas, in [66], trajectory is learned from the perspective of maintaining critical aspects of the motion. Such approaches are in fact complementary to learning the symbolic description of the task: what does the task mean?, which will facilitate how (at non trajectory level) to perform the task in different ways in various situations.

At symbolic primitives level, the task is mainly learned in two forms: (1) based on the sequence of sub-actions and (2) based on the effect in terms of changes in the environment. In the sub-action learning approaches the task of ‘place an object next to another object’ would be inferred as ‘reach’, ‘grasp’ and ‘transfer_relative’ [14]. The task of ‘Take a bottle out of the fridge’ would be sub-symbolized as ‘Open the fridge’, ‘Grasp the bottle’, ‘Get the bottle out’, ‘Close the fridge’ and ‘Put the bottle on the table in a stable position’ [20]. In [52], incremental learning of the task precedence graph, for the tasks of pouring a bottle and laying a table, have been presented. In [33], the robot grounds the table assembling task in terms of ‘reach’, ‘pick’, ‘place’ and ‘withdraw’, and tries to learn the dependencies in order to reorder and adapt for different initial setups. In [46], a hybrid approach tries to represent the entire task in a symbolic sub-task manner but also incorporates trajectory information to perform the task. In [38], learning to assemble an electric switch is presented, by providing an example of the plan by moving the manipulator through the desired assembly. The system observes the assembly and infers the underlying plan. In [16], the authors abstract the human demonstration of assembly construction to a sequence of object connections, which are used to infer a motion grammar for the robot to repeat the task. In [53], a probabilistic approach to learn relational planning rules has been presented, based on state representation by facts like on, clear, etc., and actions like paint, pickup, drive, walk, etc., covering different domains.

However, most of these approaches actually reason on actions, i.e. trying to represent a task in sub-tasks/sub-actions from the point of view of execution, which in fact is to facilitate imitation aspect of learning as discussed earlier. There is no explicit reasoning about the semantics of the task independent of the execution. As mentioned earlier our focus will be on task understanding from the effect point of view, to facilitate emulation learning. In fact, recognizing the effect of actions, based on initial and resulting world states, is an important component of causal learnability, and a complementary aspect for reasoning at action level, i.e. how to generate that effect [42].

For successful emulation, the robot should be able to analyze the effects in terms of the task driven changes. In [9], through dialogue, the task ‘to follow’ a person would be understood as to remain within 1 meter of the person. From the perspective of learning object manipulation tasks by observing human demonstrations, in [22] the effect of pick-and-place type tasks have been analyzed by using predicates such as holding object, hand empty, object at location, etc. In [43] the robot performs different actions such as grasp, touch and tap on different objects to analyze the effects; once learned could be used to select the appropriate action for achieving a particular effect [39]. However, the effects of each action on the object were described in terms of velocity, contact and object-hand distance. In [60], a first order knowledge representation and processing system KnowRob, represents the knowledge in action centric way and learns the action models of real world pick-and-place type task domain, coupled with object and its properties. In [58] an approach has been presented to learn abstract level action selection from observation by considering the position, orientation, and the symbolic interpretations of the performer’s body movement, such as bow and pick object. In [13], the robot grounds the goal of the observed tasks by symbolic concepts like ‘on the shelf’, ‘left side of the table’, etc. for pick-and-place type tasks on a table top. In [31], planning models for dexterous tasks, such as push a slider, are learned based on automatically generated contact constraints and automatically relaxing them whenever necessary, by addressing the correspondence problem because of the structural differences between the robot and the human.

However, as the focus of these works are not those types of tasks, which require one agent to perform for another agent, hence, these works does not exploit and address the necessary aspect of reasoning based on perspective taking, abilities and effort of the target-agent’s (the agent for whom the task is being performed).

In this paper we will focus on such human-centered object manipulation tasks, which make it a must to reason on the effects from the perspective of changes in target-agent’s abilities and effort, hence requiring to develop a learning framework based on the combined reasoning of such aspects.

3 Scope, Motivation and Contributions

We will consider a set of basic yet key human-centered object-manipulation tasks in a typical interaction scenario such as give, make accessible, show, hide, put-away, hide-away an object. One common effect of such tasks is to enable and/or disable the actions or abilities of the target-agent (the agent for whom the task is being performed). Such as make accessible enables the target-agent to take the object whenever he/she wants. Hide deprives the target-agent from the ability to see the object. Hence, the reasoning about the effect of a task from the target-agent’s perspective is a must for understanding such tasks.

3.1 An Illustrative Example

Let us consider an example scenario of making an object accessible to a person, i.e. an object, which is currently invisible and/or unreachable for a person, should be made visible and reachable to him/her. In Fig. 2 person P1 has to make the green bottle (marked by red arrow) accessible to person P2. The task is same, however depending upon the current mental and physical states and desires of both them and their relation, P1 could prefer to perform the task by choosing to put the bottle at different places, (b), (c) and (d). Here, the interesting point is, P1 is able to infer from P2’s perspective that if P2 will stand up, lean forward, and stretch out her arm, she can get the bottle in (b), whereas in (c), P2 will be just required to stretch out her arm. In (d) as an attempt to balance mutual efforts, P1 leans forward and puts the bottle at a place, which requires P2 also to lean and stretch out the arm to take it. This suggests that the robot should be also able to perform the perspective taking not only from the current state but also from different states of the agent. Hence, this shows the necessity of combined reasoning based on effort, ability and perspective taking.

Fig. 2
figure 2

a Initial scenario for the task of making the green bottle (indicated by arrow) accessible to P2 by P1. P1 puts the bottle so that it will be visible and reachable by P2 if she will b stand up, lean forward and stretch out her arm, c just stretch out the arm, d lean forward and stretch out the arm from the sitting position. (Color figure online)

Now, assume that the robot is observing the task as performed in Fig. 2c, and learns just by reasoning about the actions, in terms of symbolic sub-tasks such as grasp object, carry object and put object at ‘x’ distance from the person P2 or put the object reachable by P2’s current position. In this case, it will not be able to identify that the tasks performed in Fig. 2b, d are the same tasks. This is because of two main reasons: (1) what the robot has learned is, how to perform the task, (2) it did not reason at correct level of abstraction required for such tasks. In this example a better understanding of the task should be: the object should become ’easier’ to be seen, reached and grasped by the target-agent, P2. Hence, the robot should be able to infer the facts at different levels of abstractions, which might not be directly observable, such as comparative facts: easier, difficult, etc. and use them in the learning process. This points towards building a hierarchy of knowledge based on different levels of abstractions.

3.2 Main Contributions

In [42] two desirable capabilities of an autonomous causal learnability have been discussed as: (a) Ability to infer the indirect facts, which could be obtained by ramifications of the action’s effects. (b) Build a hypothesis that the agent can use to make predictions of effect-based resultant world state from a novel initial state, which has not been observed before.

The main contributions of the paper are to deal with the above-mentioned two components by incorporating the aspects explored earlier in this section in the following manner:

  1. (i)

    Hierarchical knowledge building incorporating effort, ability and perspective taking: Enriching the robot’s knowledge with a set of hierarchy of facts related to agent’s capabilities and object state. We enable the robot to infer comparative facts such as easier, difficult, maintained, reduced, etc. as well as qualitative facts such as supportive, non-supportive, etc. This requires what we call as multi-state visuo-spatial perspective taking reasoning about the agents, i.e. by combining the aspects of effort, ability and perspective taking. To the best of our knowledge, facts based on such reasoning have not been inferred and used in the context of the robot understanding tasks from demonstrations.

  2. (ii)

    Human-level task semantics understanding through explanation based learning: We present an explanation based learning (EBL) framework to learn effect-based tasks’ semantics by building a hypothesis tree. Further, we have incorporated m-estimate based reasoning to find consistency based relevant predicates for a task, which also incorporates the notion of experience. The framework autonomously learns at the appropriate levels of abstractions for different tasks. We argue that such human-level understanding successfully holds for novel scenarios as well as facilitates transfer of understanding among heterogeneous robots.

Positive Demonstrations by Expert Teacher Learning based on expert demonstrations is an acceptable practice for learning explanation and goal driven autonomy [63, 65]. Therefore, we will also assume that the demonstrations are by expert teachers and we are not trying to teach something to a “child” robot with non-expert teacher or wrong demonstrations.

4 Effect-Centric Knowledge Abstractions Building: A General Guideline

As discussed, to capture the ‘meaning’ of the aforementioned tasks, it is important to reason about the capabilities and constraints of the agents involved. Further, to facilitate learning at appropriate abstraction level, there should be different levels of abstractions based on those facts, which might not be directly observable through sensor based data, for example the fact that something became easier. Hence, in this section and in the section following, we present the first contribution of the paper: hierarchical knowledge building, by enabling the robot to infer the facts at different levels of abstractions.

The hierarchical knowledge is built from the perspective of analyzing effects, hence requires the comparative reasoning about the initial and the final states corresponding to the task. We consider two types of effects to be incorporated in the hierarchy: (1) Ability based effects: building relations based on changes in applicability of abilities of the agent. (2) Perceptual situation based effects: building relations based on changes in perceived information about the status and the situation of the world. Our approach of developing the hierarchy is also influenced by the intuitive understanding of the tasks and their semantic differences. For example, if the task is to show something, we know that the object’s visibility should increase for the target-agent (the agent to whom the object is being shown). Hence, there should not only be an attribute related to measuring visibility but also some attribute should capture the notion of change in the visibility.

Fig. 3
figure 3

A general representation of fact hierarchy from the perspective of analyzing effect

Based on such reasoning, we proposed different levels of abstraction in the knowledgebase, see Fig. 3. (we assume level 1 is highest level of abstraction):

  1. (i)

    Level 4, geometric reasoning: Agents’ configurations and the geometric world state. At this level different types of geometric reasoning can be performed to extract the values of some geometric attributes. Those can be the geometric position of the object, reasoning about IK based actions for agents, etc.

  2. (ii)

    Level 3, comparable attributes: Quantitative and symbolic reasoning based attributes, which can be directly used to compare two values. Based on the lower level such facts could be the visibility score of an object, the effort of an agent, the state of the object, etc.

  3. (iii)

    Level 2, comparative effect: This corresponds to the initial and final world states based relative comparisons of the ways the value of an attribute changes or maintained. The comparable facts at level 3 are used to derive such comparative effects. In general, the comparison result could be stated in different forms, such as unchanged, easier, difficult, increased, decreased, maintained, gained, lost, etc. Therefore, the choice of range of the values of such facts depends upon the attribute itself, the context and the requirement.

  4. (iv)

    Level 1, qualitative nature: This corresponds to the symbolic facts, which qualify the nature of the comparative effect and the resulting changes in the value of its child attribute. Existence of Such level of abstraction is more relevant, if comparative effects in the level 2 can have more than two types and some of them can be grouped together in a meaningful manner. For example, if comparative effect about visibility can have three values: increased, maintained, and decreased, then we can group and quantify first two values as of supporting nature, which support the agent’s ability to see, whereas the decreased value can be seen as unsupportive nature for the agent’s ability to see.

Depending upon the target domain of the task and the type of the attribute, we can decide about the different levels of abstraction to be included in building the corresponding knowledge hierarchy. Such as, if there is an obvious meaning of intention behind some changes in the effect and at least two such changes could be further assigned a single meaning, then we can have the fact that qualifies the nature of that intention. For, example, if a task maintains or better facilitates for a particular ability of an agent, we can say that there is an intention to support that ability, so we will have the abstraction up to level 1. On the other hand, if the attribute at level 2 can take only two possible values, there might be no reason to further qualify those values. For example, if something can be either changed or maintained, then there is no need to further abstract them at a higher level. In that case, any other label will just serve as synonym.

Depending upon the context and the implementation, the selection of different attributes, different ability types of the agent to consider as well as the computation of the effect can vary. Below, we describe the attributes and methods we have chosen to instantiate the hierarchical knowledgebase adapted to our domain of human-centered object manipulation (HCOM) tasks.

5 Instantiation of the Hierarchical Knowledgebase for Basic Human-Centered Object Manipulation Tasks Domain

In this section, we identify different basic attributes to instantiate the hierarchical knowledgebase for the current context of human-level understanding of the basic HCOM tasks. As discussed in Sect. 3 one contribution of the paper is to identify the importance of combined reasoning about abilities, effort and the perspective taking. In this direction, based on such combined reasoning, we have identified some important predicates. Further, we have also shown in Sect. 8.3 that without them an intuitive understanding of a task cannot be achieved.

The guidelines of the different levels of abstraction outlined in Fig. 3 will be used to instantiate different hierarchies of attributes related to the human and the object. In the trees of the hierarchical representation, the superscripts 1 and 2 represents the facts related to the initial world state WI and the final world state WF, respectively.

We have explain in the experimental platform Sect. 7.1 that the robot maintains and updates 3D world model, to represent a geometric world state, similar to the one shown in Fig. 4a. By reasoning on this 3D model of the world, the robot infers various facts related to agent abilities, object state and affordances. Next, we will first discuss various abilities of the agents, our robot is able to infer, and then instantiate a subset of hierarchical knowledgebase based on combined reasoning about abilities, effort and perspective taking. Then we will discuss agent and object status based hierarchy of facts and eventually build our hierarchical knowledgebase of the domain, within the scope of the paper.

Fig. 4
figure 4

a, b: First row, c, d: Second row. a Robot observing a human–human interaction. b P1’s current state visual perspective, visibility scores (ViS) is 0.0 for entirely hidden toy dog. c ViS is 0.001 when the toy dog is partially occluded and relatively far. d ViS is 0.003 for non-occluded and relatively closer toy dog

5.1 Abilities, Effort and Perspective Taking Based Knowledge Building

By reasoning on the agents’ models, the robot estimates various abilities of the agents. An agent’s state, S, is defined by his/her/its configuration, position and orientation. Reasoning on the agent state and the environment, various abilities to see, reach and grasp are inferred.

5.1.1 Ability to See (Se) and the Associate Visibility Score (ViS)

An object is said to be seen if at least one cell belonging to the object is visible by the agent from a particular state. Further, the robot keeps track of how much the object is visible. For this the robot calculates the visibility score, ViS, by dividing the number of pixels of the object in the image of the field of view of agent by the total number of pixels in the image of the field of view, from a particular state of the agent. Figure 4b–d shows different visibility scores for toy horse from human P1 perspective from his current state.

5.1.2 Ability to Reach (Re)

An object is said to be reachable if at least one cell (in the grid based representation of the workspace, see [47] for implementation detail) belonging to the object is within the length of the fingertip from the shoulder in a particular state of the agent; that is how we also perceive a rough reachability [10].

5.1.3 Ability to Grasp (Gr)

As an object might be reachable to an agent for various purposes such as to touch, push, point, grasp, etc., the robot further distinguishes whether the reachable object is graspable or not. Our robot can generate a set of grasps for different multi-fingered hands/grippers, for object of different shapes [57]. Figure 5 shows a subset of the generated grasps for different objects for the robot’s gripper and for an anthropomorphic hand used to reason on graspability of the human. If there exists at least one collision free grasp for the reachable object, the object is assumed to be graspable by the agent’s hand.

Fig. 5
figure 5

Subsets of grasps for an anthropomorphic hand and the robot’s gripper, generated for different objects (see [57])

5.1.4 Multi-State Perspective Taking

Perspective taking has already been shown as an important aspect in learning [6]. In fact, it has been shown as a key component in shaping and grounding our day-to-day interactions. Perspective taking is important for how we interact with others [24], to reduce ambiguity and for grounding [55, 62], for action recognition [30], for proactive behavior [50], planning interactive and cooperative tasks [34], sharing attention [41] and so on.

In [47], taking inspiration from the studies in neurosciences and behavioral psychology, we have presented the concept of Mightability, which stands for Might be Able to..., based on the reasoning about multi-state visuo-spatial perspective taking. The idea is to analyze various abilities such as ability to reach, ability to see, etc. of an agent not only from the current state of the agent, but also from a set of sates, which the agent might achieve from his/her/its current state. For this the robot applies, \(A_V\), an ordered list of virtual actions, to make the agent virtually attain a state and then estimates the abilities \(Ab \in \{ See , Reach , Grasp \}\), by respecting the environmental and postural constraints of the agent. Currently,

$$\begin{aligned} A_{V}\subseteq \left\{ A_{V}^{ head }, A_{V}^{ arm }, A_{V}^{ torso }, A_{V}^{ posture }, A_{V}^{ displace } \right\} \end{aligned}$$
(1)

where,

$$\begin{aligned} A_{V}^{ head }\subseteq \left\{ Pan \_ Head , Tilt \_ Head \right\} \end{aligned}$$
(2)
$$\begin{aligned} A_{V}^{ arm }\subseteq \left\{ Stretch \_ Out \_ Arm \left( left|right \right) \right\} \end{aligned}$$
(3)
$$\begin{aligned} A_{V}^{ torso }\subseteq \left\{ Turn \_ Torso , Lean \_ Torso \right\} \end{aligned}$$
(4)
$$\begin{aligned} A_{V}^{ posture }\subseteq \left\{ Make \_ Standing , Make \_ Sitting \right\} \end{aligned}$$
(5)
$$\begin{aligned} A_{V}^{ displace }\subseteq \left\{ Move \_ To \right\} \end{aligned}$$
(6)

This enables the robot to answer following queries:

$$\begin{aligned}&?\left\{ obj \right\} \Rightarrow \left\{ Ab= True | False , apply (A_{v}), by(Ag) \right\} \end{aligned}$$
(7)
$$\begin{aligned}&?\left\{ A_{v} \right\} \Rightarrow \left\{ Ab= True | False , for(obj), by(Ag) \right\} \end{aligned}$$
(8)

For example, from Eq. 7, the robot can find the objects, which will be visible, Ab:see=True, if the human, Ag=human, will stand up and lean forward, i.e. \(A_{v}\) =[Make_Standing, Lean_Torso]. Whereas by Eq. 8, the robot could find the ordered list of action \(A_{v}\) required by an agent to see, reach or grasp a particular object.

5.1.5 Effort Based Comparable Facts: Types of Efforts for Agent’s State Change

The robot needs to quantify the efforts associated with actions to attain a state from another state of the agent. For this, the robot associates a type to effort in terms of involved body parts for the virtual action \(A_{v}\). Figure 6 shows the types, which in fact is motivated from the studies of human movement and behavioral psychology [15, 27], where different types of reach actions of the human have been identified and analyzed. Figure 7, shows taxonomy of such reach involving simple arm-shoulder extension (arm-and-shoulder reach), leaning forward (arm-and-torso reach) and standing reach.

Fig. 6
figure 6

Effort types for visuo-spatial abilities

Fig. 7
figure 7

Effort based taxonomy of reach action

We define a mapping operator \(E_t\) to assign a type of effort level \(T_E\) (from table of Fig. 6) to a virtual action \(A_{v}\) from the state S (currently represented as configuration, position and orientation) of the agent, as:

$$\begin{aligned} E_{t}(S,A_{v})\rightarrow T_{E} \end{aligned}$$
(9)

In the current implementation, Eq. 8 always returns the actions requiring least effort. This could be achieved in different ways: (a) By using approximate but online estimation of all the abilities by applying all the subsets of \(A_{V}\) from Eq. 1 as presented in [47], and then finding the least effort state among them. If the ability is satisfying from the current state itself, then \(A_{v}\) will be set as No_Action_Required. (b) Another way to find the least effort is by using an IK (Inverse Kinematics) solver iteratively by only activating the set of joints at each iteration in a predefined order corresponding to lowest to highest effort. In the current implementation we are using approach (a) as it is online and accuracy is acceptable to infer the symbolic effort related facts at the level of abstraction within the interest of the paper.

As mentioned earlier, to the best of our knowledge, such concept of multi-state visuo-spatial perspective taking, i.e. by combining effort, ability and perspective taking, has not been exploited in task understanding from demonstration.

5.1.6 Effort Based Comparative Facts: Relative Effort Class

The robot should be able to relatively analyze two efforts. For this we define an operator, which compares two effort levels and assign a class \(C_{RE}\), as:

$$\begin{aligned} C_{RE}\left( T_{E}^{1}, T_{E}^{2} \right) = \left\{ \begin{matrix} { Remains }\_{ Same } &{} if\ T_{E}^{1}= T_{E}^{2}\\ { Becomes }\_{ Easier } &{} if\ T_{E}^{1}< T_{E}^{2}\\ { Becomes }\_{ Difficult } &{} if\ T_{E}^{1}> T_{E}^{2} \end{matrix}\right. \end{aligned}$$
(10)

Note that \(C_{RE}\left( T_{E}^{1}, T_{E}^{2} \right) \ne C_{RE}\left( T_{E}^{2}, T_{E}^{1} \right) \)

Although not used in current implementation of learning, we further have a measure of amount of effort for a particular effort level in terms of how much the agent has to turn/lean, etc. Hence, the robot could further compare two efforts of same effort level. This could be further enhanced based on the studies of musculoskeletal kinematics and dynamics models [32, 56]. Whether the input is effort level or amount of effort, the robot extracts the comparative facts of Eq. 10.

5.1.7 Effort Based Qualitative Facts: Nature of Relative Effort Class

We have further enhanced the robot’s knowledge-base with another layer of abstraction by qualifying the Relative Effort Classes (\(\hbox {C}_{RE}\)) as supportive and not supportive. Based on the intuitive reasoning that if an object becomes difficult to be reached by a person, the intention/nature behind it is not to support the person’s ability to reach the object. Hence, we qualify the intention behind the change in effort level for an ability type Ab, by assigning a nature, \(N_{REC}^{Ab}\) as:

$$\begin{aligned}&N_{REC}^{Ab}\left( C_{RE}^{Ab} \right) =\nonumber \\&\left\{ \begin{matrix} S: Supportive &{} if\ C_{RE}^{Ab}\in \left\{ \begin{matrix} Remanins \_ Same , \\ Becomes \_ Easy \end{matrix} \right\} \\ NS: Not\_ Supportive &{} if\ C_{RE}^{Ab}\in \left\{ Becomes \_ Difficult \right\} \end{matrix}\right. \nonumber \\ \end{aligned}$$
(11)

Figure 8 shows the hierarchy of facts based on efforts.

Fig. 8
figure 8

Effort based Hierarchy of facts

5.1.8 Quantitative Visibility Based Hierarchy of Facts

The robot compares two visibility scores, \(ViS^{1}\) and \(ViS^{2}\) to obtain relative visibility score classes as:

$$\begin{aligned}&C_{RViS}\left( ViS^{1}, ViS^{2}\right) \nonumber \\&\quad = \left\{ \begin{matrix} Almost \_ Same &{}\quad if (ViS^{1} - ViS^{2} \approx 0 )\\ Increased &{} if \ ViS^{1} << ViS^{2}\\ Decreased &{} if \ ViS^{1} >> ViS^{2} \end{matrix}\right. \end{aligned}$$
(12)

Again, we qualify the nature, \(N_{RViSC}\), of this relative class based on whether the quantitative visibility of the object is supported or not:

$$\begin{aligned}&N_{RViSC}\left( C_{RViS} \right) =\nonumber \\&\left\{ \begin{matrix} S: Supportive &{} if \ C_{RViS}\in \left\{ \begin{matrix} Almost \_ Same , \\ Increased \end{matrix} \right\} \\ NS:Not\_ Supportive &{} if \ C_{RViS}\in \left\{ Decreased \right\} \end{matrix}\right. \nonumber \\ \end{aligned}$$
(13)

Figure 9 shows the hierarchy of facts by analyzing the visibility scores.

Fig. 9
figure 9

Visibility scores based hierarchy of facts

5.2 Agent Status Based Hierarchy of Facts

5.2.1 Agent Posture Based

The robot tracks the body of the human and online distinguishes between standing and sitting postures, based on relative positions and orientations of the body parts. Agent’s posture predicate Post is:

$$\begin{aligned} Post \ \in \left\{ Standing , Sitting \right\} \end{aligned}$$
(14)

Further, by comparing two postures, a class is assigned as:

$$\begin{aligned}&C_{RPost}\left( Post ^{1}, Post ^{2} \right) \nonumber \\&\quad = \left\{ \begin{matrix} M: Maintained \ \ if\ Post ^{1} = Post ^{2}\\ C: Changed \ \ otherwise\\ \end{matrix}\right. \end{aligned}$$
(15)

5.2.2 Human Hand Status Based

From the human’s perspective a status is assigned to his/her hand as:

$$\begin{aligned}&H_{S}\in \left\{ Holding \_ Object :OH, Free \_of\_ object :OF, \right. \nonumber \\&\quad \left. Resting \_on\_ Support :RS \right\} \end{aligned}$$
(16)

The robot further compares two instances of human hand status from the point of view of manipulability of the object. Based on the reasoning that if the object is in the hand, then the human can directly manipulate it, a comparative class is assigned as follows (Manip stands for Manipulability, see expression 16 for other abbreviations):

$$\begin{aligned}&C_{RHS}\left( H_{S}^{1}\rightarrow H_{S}^{2} \right) \nonumber \\&\quad =\left\{ \begin{matrix} M: Manip \_ Maintained &{} if\ H_{S}^{1}= H_{S}^{2}\wedge H_{S}^{2}= OH \\ G: Manip \_ Gained &{} if\ H_{S}^{1}\ne H_{S}^{2}\wedge H_{S}^{2}= OH\\ L: Manip \_ Lost &{} if\ H_{S}^{1}\ne H_{S}^{2}\wedge H_{S}^{1}= OH\\ V: Manip \_ Avoided &{} if\ H_{S}^{1}\ne OH\wedge H_{S}^{2}\ne OH \end{matrix}\right. \nonumber \\ \end{aligned}$$
(17)

Further, a qualifying nature to a relative class, \(c=C_{RHS}\) \(\left( H_{S}^{1}\rightarrow H_{S}^{2} \right) \), from the agent’s perspective is assigned as (see expression 17 for abbreviations):

$$\begin{aligned}&N_{RHSC}\left( c \right) \nonumber \\&\quad = \left\{ \begin{matrix} MD: Manip \_ Desired &{} if \ c\ \in \left\{ M,G \right\} \\ MND: Manip \_Not\_ Desired &{} if \ c\ \in \left\{ L,V \right\} \end{matrix}\right. \end{aligned}$$
(18)

This again results into a hierarchy of facts based on human’s hand status. Note that in the current implementation, if the state of either hand changes, it is treated as change in manipulability.

Since, we chose to assign only two possible values to this comparative attribute, we further do not qualify them by adding another level.

5.3 Object Status Based Hierarchy of Facts

5.3.1 Object Placement Status Based

Based on relative positions of an object with respect to the human’s hand and other objects, a symbolic placement status to the object is assigned. Currently, the object placement status predicate can have the following values:

$$\begin{aligned} O_{s}\in \left\{ Inside \_ Container , On\_ Support , In\_ Hand , In\_ Air \right\} \nonumber \\ \end{aligned}$$
(19)

Any ambiguity in object placement status is resolved based on simple case based rules. Such as, if the hand is in contact with an object but the object is also on a support, it returns object status as On_Support.

By comparing two ordered instances of \(O_{s}\) a class is assigned as:

$$\begin{aligned}&C_{ROS}\left( O_{S}^{1}\rightarrow O_{S}^{2} \right) \nonumber \\&\quad =\left\{ \begin{matrix} M: Maintaining \left( O_{S}^{1} \right) &{} if\ O_{S}^{1} = O_{S}^{2}\\ G: Gaining \left( O_{S}^{2} \right) \wedge L: Losing \left( O_{S}^{1} \right) &{} otherwise \end{matrix}\right. \end{aligned}$$
(20)

Note the second case results into two simultaneous facts to encode the transition: gaining and losing states by the object. For example, for the lift task if initially the object was on support and now it is in hand, then the expression (20) will result into two facts: Losing On_Support state and Gaining In_Hand state, to encode the transition.

Further, we qualify the nature of the change, \(c=C_{ROS}\left( O_{S}^{1}\rightarrow O_{S}^{2} \right) \), as supportive to a state \(O_{S}^{'}\) if the transition maintains or gains that state, as:

$$\begin{aligned}&N_{ROSC}\left( c \right) =\nonumber \\&\left\{ \begin{matrix} S: Supportive \left( O_{S}^{'} \right) &{} if\ c\in \left\{ M \left( O_{S}^{'} \right) , G\left( O_{S}^{'}\right) \right\} \\ NS:Not\_ Supportive &{} \quad if\ c\in \left\{ L \left( O_{S}^{'} \right) \right\} \end{matrix}\right. \nonumber \\ \end{aligned}$$
(21)

Hence, a hierarchy of facts based on object placement states is built, as shown in Fig. 10.

Fig. 10
figure 10

Object state based hierarchy of facts

5.3.2 Object Motion Status Based

As already illustrated in Fig. 1, the environment observation and inference is continuous in time. Hence, based on the temporal reasoning about the object’s position, at any point of time the motion status of the object is knows as:

$$\begin{aligned} O_{ms}\in \left\{ Moving :Mv, Static :St \right\} \end{aligned}$$
(22)

Further, by comparing two instances of motion status, a relative status class for the object’s motion state transition is assigned as follows [see expression (22) for abbreviations]:

$$\begin{aligned}&C_{ROMS}\left( O_{ms}^{1}\rightarrow O_{ms}^{2} \right) =\nonumber \\&\left\{ \begin{matrix} motion \_ gained &{} if\ O_{ms}^{1}= St\wedge O_{ms}^{2}= Mv\\ motion \_ lost &{} if\ O_{ms}^{1}= Mv\wedge O_{ms}^{2}= St\\ motion \_ maintained &{} if\ O_{ms}^{1}= Mv\wedge O_{ms}^{2}= O_{ms}^{1}\\ motion \_ avoided &{} if\ O_{ms}^{1}= St\wedge O_{ms}^{2}= O_{ms}^{1} \end{matrix}\right. \end{aligned}$$
(23)

As pointed in the beginning of the section, one contribution of this paper is to identify the key aspects of reasoning about ability, effort and perspective taking based predicates (which we think are more relevant for the tasks within the scope of the paper). However, it could further be enriched by other types of facts, e.g. predicates, which can capture the effects of type: putting on the top of a particular object or at a particular place.

In this section we have enriched the robot’s knowledgebase with a set of hierarchy of facts related to the human and the object, which is inferred by the robot online. Next section will describe our generalized task-understanding framework based on explanation-based learning and m-estimate based refinement. The framework takes into account such hierarchies of facts and autonomously learns the tasks’ semantics at appropriate levels of abstractions.

6 Explanation Based Task Understanding

In addition to have human-level understanding of the task and to make such understanding independent of how the task has been executed, another motivation behind the current work is to enable the robot to begin learning even from the single positive demonstration and refine with subsequent demonstrations. Therefore, we have adapted the framework of Explanation Based Learning (EBL) (see the survey [65]), which has been shown to possess the desired characteristics and could be used for concept refinement (i.e. specialization) as well as concept generalization [19]. For continuity, below we mention the components of a typical EBL system (see [19] for detail):

Goal Concept: A definition of the concept to be learned. Given in terms of high-level properties, which are not directly available in the representation of an example.

Training Example: A lower level representation of the examples.

Domain Theory: A set of inference rules and facts sufficient for providing that a training example meets the high-level definition of the concept.

Operationality Criterion: Defines the form in which the learned concept definition must be expressed.

Generally domain theory and operationality criterion are devised to restrict the allowable learned vocabulary and initial hypothesis space, to ensure that the new concept is ’meaningful’ to the problem solver (the task planner).

Our approach is similar to EBL [23, 65], in the following manner:

It (1) constructs an explanation tree for each example of a task, (2) compares these trees to find largest sub tree, and (3) forms a horn clause using the leaf nodes of the largest sub tree to find the general rule.

However, towards addressing the issue of scalability, our approach will differ from EBL in the sense, instead of providing a proper domain theory and operationality criterion for the target-concept to precise the hypothesis space, we will provide a general goal concept in terms of the effect of the task. This will create a hypothesis space with wide ranging knowledgebase of the robot, spanning over lowest to highest levels of abstractions. This will ensure to learn any task, which could possibly incorporate any of the effect related predicates known to the robot. Then based on the demonstrations, the robot has to autonomously refine/prune the hypothesis space. This will prevent providing separate domain theory for each and every task the robot will encounter, as well as will enable the robot to autonomously extract relevant features for a particular task.

We are following the framework of the explanation based learning, which requires providing a goal concept and a domain theory (a set of inference rules and facts sufficient for providing that a training example meets the high-level definition of the concept). That is why we have to provide a hypothesis space in terms of the knowledgebase, which encodes the domain theory in terms of various inference rules. In the future work, we will address the aspect of generating a hypothesis space based on different knowledge gathering approaches, such as [36, 60].

6.1 General Target Goal Concept to Learn

We provide for any task tk, performed by a performing-agent \(P_{ag}\) for a target-agent \(T_{ag}\) on a target-object \(T_{obj}\), the generalized goal concept to learn as:

$$\begin{aligned} Task\left( tk \right) \leftarrow effect \left( WI,WF \right) \end{aligned}$$
(24)

As illustrated in Fig. 1 WI and WF are snapshots of the continuously inferred facts and continuously observed world states at time stamps \(t_{i}\) and \(t_{f}\) marking the start and the end of a demonstration.

6.2 Provided Domain Theory

Further, the following domain theory is provided:

$$\begin{aligned} effect \left( WI,WF \right) \leftarrow N_{REC}^{reach}\left( T_{ag},T_{obj} \right) \wedge \nonumber \\ N_{REC}^{grasp}\left( T_{ag},T_{obj} \right) \wedge N_{REC}^{see}\left( T_{ag},T_{obj} \right) \nonumber \\ \wedge N_{RViS}\left( T_{obj},T_{ag} \right) \wedge C_{RPost}\left( T_{ag} \right) \wedge \nonumber \\ N_{RHSC}\left( T_{ag} \right) \wedge N_{ROSC}\left( T_{obj}\right) \wedge C_{ROMS}\left( T_{obj}\right) \end{aligned}$$
(25)

The task is learned in the form of desired effects from ‘any’ target-agent’s perspective for ‘any’ target-object.

The above expression when mapped into the definitions of inferred facts discussed earlier in this paper, results into the following representation:

$$\begin{aligned} effect \left( WI,WF \right) \leftarrow \nonumber \\ Nature \_ Effect \_ Class \_to\_ Reach \left( T_{ag},T_{obj} \right) \wedge \nonumber \\ Nature \_ Effect \_ Class \_to\_ Grasp \left( T_{ag},T_{obj} \right) \wedge \nonumber \\ Nature \_ Effect \_ Class \_to\_ See \left( T_{ag},T_{obj} \right) \wedge \nonumber \\ Nature \_ Visibility \_ Score \left( T_{obj},T_{ag} \right) \wedge \nonumber \\ Effect \_ Relative \_ Posture \left( T_{ag} \right) \wedge \nonumber \\ Nature \_ Effect \_ Hand \_ Status \left( T_{ag} \right) \wedge \nonumber \\ Nature \_ Effect \_ Object \_ Status \left( T_{obj}\right) \wedge \nonumber \\ Effect \_Object\_Motion\_Status\left( T_{obj}\right) \end{aligned}$$
(26)

The definitions of the domain theory are already presented in expressions of Sect. 5.

Above domain theory when unfolded results into a general initial hypothesis space as shown in Fig. 11.

Fig. 11
figure 11

Initial generalized hypothesis space for human-level effect-based understanding of tasks’ semantics. The symbolic mapping of the notations from the top–down order from left to right: \(N_{REC}\) : Nature of Relative Effort Class *, \(C_{RE}\) : Class of Relative Effort *, \(T_E\) : Type of Effort *, \(S\) : State *, \(A\) : Action*, (* : common for the first, third and sixth subtrees for different abilities of reach, grasp and see) \(C_{RPost}\) \(=\) Class of Relative Posture, \(Post\) : Posture, \(C_{ROms}\) \(=\) Class of Relative Object motion state, \(O_{ms}\) \(=\) Object motion state, \(N_{RViSC}\)  \(=\) Nature of Relative Visibility Score Class, \(C_{RViS}\) \(=\) Class of Relative Visibility Score, \(ViS\) \(=\) Visibility Score, \(N_{RHsC}\) \(=\) Nature of Relative Hand status Class, \(C_{RHs}\) \(=\) Class of Relative Hand status, \(H_s\) \(=\) Hand status, \(N_{ROsC}\) \(=\) Nature of Relative Object status Class, \(C_{ROs}\) \(=\) Class of Relative Object status, \(O_s\) \(=\) Object status

The training examples are provided in the form of lowest level representation, i.e. in 3D world model consisting of the positions and configurations of the objects and the agents. As the robot continuously observes and infers the environment, based on the time stamps of the start and the end of a demonstration, the robot autonomously instantiates the hierarchies of facts within the domain theory.

Further, to be generalized enough to learn different tasks; we do not strictly provide the form of the learned concept as operationality criterion. It could be composed of any set of nodes of the initial hypothesis space as shown in Fig. 11.

6.3 m-Estimate Based Refinement

Each node of initial hypothesis space of Fig. 11 serves as a predicate. For refining the learned concept based on multiple demonstrations, instead of directly pruning the explanation sub-tree based on getting two different values for a node, we use m-estimate based reasoning. m-estimate has been shown to be useful for rule evaluation [26] and to avoid premature conclusions [1]. This is because the generalized definition of m-estimate incorporates the notion of experience, as described below.

Let us say a value v for a particular predicate p for a particular task tk has been observed in n number of demonstrations, out of a total of N demonstrations. The likelihood (i.e. the measure of the extent to which a sample provides support for particular values) of observing the same value v for the next demonstration within the m-estimate framework will be given as:

$$\begin{aligned} Q_{p}^{v,tk}\left( n,N \right) = \frac{n+a}{N+a+b} \end{aligned}$$
(27)

where, \(a> 0, b> 0, a+b=m \ and\ a=m \times p_{v}\)

m is domain dependent, and could also be used to include noise [12]. From Eq. 27 following properties could be deduced:

$$\begin{aligned}&Q_{p}^{v,tk}\left( 0,0 \right) =P_{v}> 0\end{aligned}$$
(28)
$$\begin{aligned}&Q_{p}^{v,tk}\left( 0,N \right) =\frac{a}{N+a+b} > 0\end{aligned}$$
(29)
$$\begin{aligned}&Q_{p}^{v,tk}\left( N,N \right) =\frac{N+a}{N+a+b} < 1 \end{aligned}$$
(30)

Hence, it does not assume a close world in the sense if it did not observe a value v for predicate p, it does not mean that likelihood of the existence of v is NULL [expressions (28), (29)]. In fact, \(p_v\) is prior probability of v, as also can be inferred from Eq. 28. On the other hand, if always the same value v has been observed, that too will not be accepted as universal rule that p will always have the value v for the task tk [expression (30)]. Hence, it takes into account the likelihood of unseen demonstrations. These properties allow lifelong refinement of the learned concept.

$$\begin{aligned} Q_{p}^{v,tk}\left( N+1,N+1 \right) > Q_{p}^{v,tk}\left( N,N \right) \end{aligned}$$
(31)

Above property [expression (31)] ensures that even the value v has been observed for all the examples, the likelihood to observe same value will be more if more number of examples has been demonstrated, thus incorporating the notion of experience.

$$\begin{aligned} Q_{P}^{v,tk}\left( 0,N\right) < Q_{P}^{v,tk}\left( 0,N+1 \right) \end{aligned}$$
(32)

This property ensures that even the value v has never been observed, the likelihood that v will not be observed in the future will be less if less number of examples have been demonstrated, thus again incorporating the notion of experience.

One acceptable instantiation of m-estimate is using Laplace’s law of succession. This states that if in the sample of N trials, there were n successes, the likelihood of the next trial being successful is (n + 1)/(N + 2), assuming that the initial distributions of success and failure are uniform. With the similar initial assumption, we also use a = 1 and a + b = 2 for m-estimate of Eq. 27.

6.4 Consistency Factor

As the robot is required to autonomously reason about whether a predicate p is relevant or not, it analyzes the consistency in the observed value of the predicate. If the values are not always the same, it means the predicate might not be relevant for that task and the values are just the side effects, not the desired effect. We further assume that \(v_{h}\) is the value for p having the highest m-estimate obtained from Eq. 27. If this value is consistent over demonstrations, then the predicate p is relevant and its desired value will be \(v_{h}\). Let, for a particular predicate p, over N demonstrations, \(N_{p}\) different values \(\left\{ v1,v2,v3,\ldots v_{Np} \right\} \) have been observed. We define a consistency factor (CF) of p for task tk to decide about the relevance of p as:

$$\begin{aligned} CF_{p}^{tk} =\overbrace{Q_{p}^{v_{h},tk}}^\mathrm{relevance evidence} -\underbrace{\sum _{i=1\wedge i\ne h}^{Np}Q_{p}^{v_{i},tk}}_\mathrm{non-relevance evidence} \end{aligned}$$
(33)
Fig. 12
figure 12

Deciding relevance and irrelevance of a predicate, as well as potential confusion

The first part on the right side of the equation shows the evidence of p being relevant for the task. Higher this value, greater will be the likelihood that the most observed single value, \(v_{h}\), for p is the part of the desired effect for task tk. The second part gives the likelihood of obtaining any of the observed value other than \(v_{h}\). This in fact represents non-relevant evidence of p, \(NRE_{p}\), because, higher this value, lower the likelihood of p having a consistent value. Hence, based on the value of the consistency factor after any demonstration, we define following three situations about a particular predicate p for a particular task tk (see Fig. 12):

  1. (i)

    Contradiction, so irrelevant p: A predicate p will be assumed to be non-relevant based on contradiction in its value, (a) if \(CF < 0\); non-relevant evidences are collectively higher than the relevant evidence, or (b) if \(0 <= CF <= d_{1}\); non-relevant evidences are significant to contradict the likelihood of \(v_{h}\) being the expected consistent value of p.

  2. (ii)

    Consistency, so relevant p: if \(CF>d_{2}\); as the non-relevant evidences are significantly lower and could be ignored.

  3. (iii)

    Confusion, require clarification: if \(d_{1}<CF<d_{2}\); as the non-relevant evidences are not sufficient to contradict the current understanding but also not low enough to be ignored directly. In this case, the robot has to ask the human partner for clarification about the significance of the predicate p and its desired value by framing a sentence including the values causing the confusions.

It is evident that by setting different values on the threshold and maintaining separate boundaries allow to tune \(d_{1}\) and \(d_{2}\) based on various practical factors such as the reliability of the demonstration, the accuracy of the inferred fact, noise at different levels of the system, nature, sensitivity and criticality of the domain and attributes, preferences on inconsistency tolerances, etc.

However, as the demonstrations are assumed to be positive, which means we will not try to teach a child with wrong examples, little evidence of a predicate assuming different values should be sufficient to prune that node from the explanation tree. This assumption results into almost coinciding \(d_{1},\,d_{2}\) and \(Q_{p}^{v_{h}}\). Therefore, we have set \(d_{2}\) based on the 10 % tolerance of the inconsistency in relevant predicate, i.e. \(l_{2}= 0.1 \times Q_{p}^{v_{h}}\), hence:

$$\begin{aligned} d_{2}= Q_{p}^{v_{h}}-0.1\times Q_{p}^{v_{h}} \end{aligned}$$
(34)

Similarly, we have given the robot greater autonomy to decide a predicate to be irrelevant if there exists non-relevant evidence as low as 30 % of the relevant evidence, i.e. \(l_{1}= 0.3 \times Q_{p}^{v_{h}}\) which results into:

$$\begin{aligned} d_{1}= Q_{p}^{v_{h}}-0.3\times Q_{p}^{v_{h}} \end{aligned}$$
(35)

Setting the above-mentioned very narrow band of confusion zone is mainly for two purposes: (1) Demonstrations are assumed to be positive and observation of the world is assumed to be almost noise free at the symbolic level. (2) Our robot was not equipped with sufficient speech synthesis and dialogue module and no feedback interface was installed in front of the demonstrators. Therefore, we wanted to avoid any explicit communication, instead to let the demonstrator to focus on the task and the robot to focus on observing demonstration and inference of facts.

At this point, it is important to note that the robot keeps track of m-estimate of all the observed values for all the predicates to maintain the notion of inter-value experiences, even currently a particular value might have been found irrelevant. This facilitates the robot to incorporate experience and allow modifying its understandings lifelong.

7 Demonstrations and Analysis

7.1 Experimental Platform

The robot uses Move3D [59], an integrated 3D planning and visualization platform. Through its various sensors, the robot maintains and updates the 3D world state in real time. For object identification and localization, it uses tags based stereovision system. For localizing and tracking humans, it uses data from 3D motion sensors mounted on it. The human’s gaze is simplified to his/her head orientation, estimated through markers tracked by a marker based motion capture system in real time.

We have tested our system on two different robots: JIDO a home-built mobile manipulator equipped with a LWR Kuka arm and PR2 robot from Willow Garage. As shown in Fig. 13a, b, Jido and PR2 robots are observing the environment. Figure 13c, d show the 3D world representation of the environment built and updated online by the robots.

Fig. 13
figure 13

Mobile robots a JIDO and b PR2 are observing Human–Human interaction scenario. c, d 3D representation of the world built and updated online by the robots

7.2 Procedure

For each demonstration, the human operator provides the name of the task, indicates the performing-agent, the target-agent and the target-object, and tags the time stamps of start and finish of the task.

Note that the target-agent is the agent the task is targeted to, not the target of the experiment. As the robot should be able to perceive the necessary and sufficient parts of the environment before and after the demonstration. Therefore, in most of the cases the placement of the robot was closer to the performing-agent. This perception coupled with our perspective taking system, which allows the robot to reasons from the perspective of the target-agent, provides sufficient information to analyze the effect of the demonstration.

The demonstrations were the part of a bigger set of three complementary studies: The semantics understanding, which is the focus of this paper, trajectory level analysis of performing different manipulation tasks by different people, and the preferred places to perform the tasks with respect to both the agents in different setups. Although in our case, it would have been sufficient to have demonstrations by a single expert teacher, because of the other aspects the demonstrations were performed by different demonstrators. However, since we did not tell the demonstrators about where to perform the task, it adds some inherent diversity to further demonstrate the working of the presented framework.

The demonstrators were the researchers from our robotics lab selected based on their availability. They are aware about the perception system of the robot; hence, they know how to hold the object (i.e. not to hide the tag) so that the robot can track it most of the time. They have been told about the name of the task they are expected to perform and some hints about some of the confusing aspects. For example, most of them have been told about the main difference between make accessible and give, that the object should be placed somewhere in the former case. However, most of them did not know the algorithm and what was being learned. There were three to five demonstrations for each task and data used in this paper are taken at least from two different demonstrators for each task.

In the current approach, the name of the target-object is explicitly provided to the robot. However, works on autonomous learning of task-relevant objects such as [35] could be adapted for this purpose.

In all the demonstrations, the explanation tree has been constructed by inferring the facts from the target-agent’s perspective. This is to find the desired effect for the person for whom the task has to be performed. However, a similar tree could be constructed from the perspective of the performing-agent as well, to find how the agent prefers to perform the task.

As explained in Sect. 6, the robot constructs an explanation tree for each new demonstration of the task, by instantiating the hypothesis tree of Fig. 11. For instantiating the leaf nodes, the predicates with superscript 1 represent the data from WI, the initial world state, whereas superscript 2 represent the data from WF, the final world state.

Greater the diversity among the demonstrations for the same task, faster the non-relevant predicates will be pruned out from the task’s understanding. To achieve the diversity across demonstrations we used to change the scenarios by changing the relative arrangements of the performing- and target- agents, the target-object and its initial position, and the positions of other objects and furniture.

Next we will discuss the resulting understanding of the robot for different tasks.

7.3 Show an Object

The first task demonstrated to the robot was to show an object. Figure 14a–d show final scenarios of four different demonstrations of the task. The red quadrilaterals show initial positions of the target-object [which is cup in (a) and wooden cube in (b), (c) and (d)], the red arrows mark the final position of the target-object at the end of the task. In situations (a) and (c) the target-agent was the person on the left, whereas, for (b), he was the performing-agent. In (d) the person on the right was the target-agent. The largest consistent sub-tree after first two demonstrations (a) and (b) has been shown in Fig. 15. Below each node of the tree the corresponding inferred values of the predicates have been shown in parenthesis {}. The learned target concept for the task is obtained in terms of horn clause from the leaf nodes of this sub-tree.

Fig. 14
figure 14

Human–Human demonstration for show an object task. Initial positions of the target-object are shown by red quadrilaterals. a Right human is showing the cup by holding it. b Left human is showing the wooden cube by holding it. c Right human is showing the wooden cube by holding it. d Left human is showing the wooden cube by making it visible by putting it on the top of the white box. (Color figure online)

Fig. 15
figure 15

Explanation tree for the show task after two demonstrations a and b of Fig. 14

Figure 16 shows partial instantiation of the hypothesis space for the individual demonstration of Fig. 14c. And Fig. 17 shows the refined explanation, based on the largest common, m-estimate based consistent, sub-tree for all the three demonstrations. In the fourth demonstration for the same task, a different pair of performing and target agents demonstrated the task in standing postures. The performing-agent has put the target-object, the wooden cube, on another object, white box, to make it visible, as shown in Fig. 14d. Figure 18 shows the refined explanation tree. The refined understanding after these four demonstrations, formed by the horn clause of the leaf nodes is:

$$\begin{aligned} Task\left( Show\_Object \right) \leftarrow (C_{RPost}=Maintained)\wedge \nonumber \\ (O_{ms}^{1}=Static)\wedge (O_{ms}^{2}=Static)\wedge \nonumber \\ (C_{RViS}=Increased)\wedge \nonumber \\ (A_{v}^{2}(see)= No\_Action\_Required)\wedge \nonumber \\ (H_{s}^{1}=Object\_Free)\wedge (H_{s}^{2}=Object\_Free)\nonumber \\ \end{aligned}$$
(36)

By replacing the abbreviations with the symbolic terms, presented in Sect. 5, the above expression results into the following:

$$\begin{aligned} Task\left( Show\_Object \right) \leftarrow \nonumber \\ (Relative\_Posture=Maintained)\wedge \nonumber \\ (Object\_Initial\_Motion\_Status=Static)\wedge \nonumber \\ (Object\_Final\_Motion\_Status=Static)\wedge \nonumber \\ (Object\_Relative\_Visibility\_Score=Increased)\wedge \nonumber \\ (Action\_to\_See=No\_Action\_Required)\wedge \nonumber \\ (Initial\_Hand\_Status=Object\_Free)\wedge \nonumber \\ (Final\_Hand\_Status=Object\_Free)\nonumber \\ \end{aligned}$$
(37)

Note that the above understanding is from the target-agent’s perspective. This means the target-agent should put no effort to see the target-object, the visibility score of the target-object should be increased from the target-agent’s perspective, the hand of the target-agent should be free of object, etc. Also note that the irrelevant predicates such as reachability and graspability of the target-agent as well as the object’s status have been autonomously pruned out from the learned desired effect of the task.

Fig. 16
figure 16

Partial instantiation of the hypothesis space of Fig. 11 for explaining the show object task of demonstration (c) of Fig. 14. Note the main difference in the left most sub-trees, compared to the explanation tree of Fig. 15

Fig. 17
figure 17

Refined consistent explanation tree after the three demonstrations, a, b and c, of Fig. 14 for the show object task

Explicitly learning the preconditions for a task is not the scope of this paper, however, when a leaf node corresponding to WI appears in the learned concept, we let it be there to be useful for task planners to enrich the list of preconditions based desired effects.

Fig. 18
figure 18

Refined consistent explanation tree after the fourth demonstrations, (d), of Fig. 14 for the show object task. Note that the abstraction level for \(C_{RPost}\) has been changed, compared to the understanding of Fig. 17

Remark on m-estimate and Consistency Factor: Table 2 analyzes the consistency of node \(O_{S}^{2}\) of explanation tree, which corresponds to the final object status for the show object task demonstrations. Fourth column shows the m-estimate corresponding to the value v of the predicate \(O_{S}^{2}\) having highest m-estimate. And fifth column shows the simple probability based estimation of value v. By comparing both the columns we observe an interesting difference. m-estimate as explained earlier takes into account experience. Hence, with each new demonstration in which value v = In_Hand (first three demonstrations) is obtained, the robot’s belief to have the same value is increasing, based on experience. Whereas, in the case of probability it is always the same as 1, hence, simple probability fails to update experience based belief. This observation illustrates the property of expression (31) of m-estimate, which states that: even the same value has been observed in all the demonstrations, the expectation to observe the same value in next demonstration will be increasing with more number of demonstrations.

Table 2 Consistency factor analysis

Further, our approach does not fix the value of the thresholds across demonstrations. This can be observed by analyzing the columns seven and eight. Thresholds d1 and d2 of Fig. 12 are dynamically varying. This is because of our approach, which decides these thresholds based on the m-estimate of the relevance evidence, Eqs. (34) and (35). Note that in the last column, for the first three demonstrations, the consistency factor is well above d2, so the robot considers the predicate and its value as relevant. Now analyze the row corresponding to the fourth demonstration. This time a different value has been observed for the predicate, which is On_Support, but still the value corresponding to the highest m-estimate is In_Hand. After fourth demonstrations, the consistency factor of highest m-estimate value of the predicate is 0.33334. This is well below d1, the minimum threshold for confusion and makes the robot sure about its irrelevance.

Note that we did not observe any confusion because of our deliberately selected very narrow confusion zone. As we have already mentioned we assume the demonstrations to be positive, so a little inconsistency makes the predicate to be irrelevant. However, if we decide to give less decisional autonomy to the robot, forcing the robot to ask for clarification instead of directly deciding the irrelevance, we could make the confusion zone wider. For example, if we set the range to be between 10 and 60 % of the relevance evidence instead of 10–30 %, in that case, the same demonstration 4 will result into a confusion, as d1 will become 0.26667 hence, consistency factor 0.33334 will lie between d2 and d1.

7.4 Hide an Object

The next task demonstrated to the robot was hide an object. Fig. 19 shows three demonstrations for this task with different initial scenarios. The understanding of the task, after these three demonstrations, formed by the horn clause of the leaf nodes, results into the following symbolic representation:

$$\begin{aligned} Task\left( Hide\_Object \right) \leftarrow \nonumber \\ (Human\_Initial\_Posture=Sitting)\wedge \nonumber \\ (Human\_Final\_Posture=Sitting) \wedge \nonumber \\ (Object\_Initial\_Motion\_Status=Static)\wedge \nonumber \\ (Object\_Final\_Motion\_Status=Static)\wedge \nonumber \\ (Object\_Final\_Visibility\_Score\approx 0)\wedge \nonumber \\ (Relative\_ Effort \_Class\_to\_See=Becomes\_ Difficult )\wedge \nonumber \\ (Initial\_Hand\_Status=Object\_Free)\wedge \nonumber \\ (Final\_Hand\_Status=Object\_Free)\wedge \nonumber \\ (Object\_Initial\_Status=On\_Support)\wedge \nonumber \\ (Object\_Final\_Status=On\_Support)\nonumber \\ \end{aligned}$$
(38)

Similar to the show task, further demonstrations in which the target-agent would be standing, would result into refined understanding related to the target-agent’s posture attribute. However, note that the main differences between the understandings of the show and the hide tasks have been captured. In the hide task the effort hierarchy corresponding to the ability to see the object is pruned at relative effort class level, instead of maintaining the lowest level node, of required action, as was the case of show task. This results into the understanding of the hide task that the relative effort to see the object should become difficult for the target-agent. Also the visibility score based subtree has been pruned at lowest level of absolute value of visibility score. For the show task, the actual values of visibility score from the target-agent’s perspective were not consistent but were always greater than the initial values. Hence, the framework autonomously results into a higher level of abstraction for the show task, which is: increased relative visibility score. Whereas, in the case of the hide task the absolute value of visibility score itself is always negligible, hence, making the visibility score node as consistent and relevant.

Fig. 19
figure 19

Three demonstrations for the task of hiding an object, observed by JIDO robot. First column shows the initial scenarios and the second column shows the final scenarios after performing the task

Note that again the effect on the abilities to reach and grasp the object by the target-agent have been autonomously found to be irrelevant and pruned out from the explanation tree, as was the case for the show task.

7.5 Make an Object Accessible

Next, the task of making an object accessible has been demonstrated to the robot. There were a total of five demonstrations, three of them were similar to the demonstrations of Fig. 2b–d. The rest two demonstrations were in different relative arrangements of the objects and the humans, and the target-agent was in standing posture.

As the intention behind make-accessible task is to provide the target-agent with the flexibility to take the object sometime in future whenever required. Therefore, the provided end time stamps of the make-accessible demonstrations were the instants when the performing-agent finishes the task by putting the object on the table to make it accessible, not the instants shown in Fig. 2b–d, where the target-agent has already begun to reach the target-object.

The understanding of the robot about the make-accessible task after these five demonstrations results into the following symbolic representation:

$$\begin{aligned} Task\left( Make\_Accessible \right) \leftarrow \nonumber \\ (Relative\_ Effort \_to\_Reach=Becomes\_Easier)\wedge \nonumber \\ (Relative\_Posture=Maintained)\wedge \nonumber \\ (Object\_Initial\_Motion\_Status=Static)\wedge \nonumber \\ (Object\_Final\_Motion\_Status=Static)\wedge \nonumber \\ (Relative\_ Effort \_to\_Grasp=Becomes\_Easier)\wedge \nonumber \\ (Object\_Relative\_Visibility\_Score=Increased) \wedge \nonumber \\ (Nature\_ Effort \_Class\_to\_See=Supportive) \wedge \nonumber \\ (Initial\_Hand\_Status=Object\_Free) \wedge \nonumber \\ (Final\_Hand\_Status=Object\_Free)\wedge \nonumber \\ (Object\_Initial\_Status=On\_Support)\wedge \nonumber \\ (Object\_Final\_Status=On\_Support)\nonumber \\ \end{aligned}$$
(39)

Note that the predicate related to the posture of agent is from the point of view of desired effect on the posture due to the task. This posture does not indicate the change of posture, which might occur due to the actions required by the target-agent to see, reach or grasp the target-object. This is captured in other set of attributes encoded in effort-based hierarchy, for example \(Relative\_ Effort \_to\_Reach\). The Relative_Posture predicate obtained in the above understanding indicates that the task does not change the posture of the agent.

The interesting observation for the make-accessible task understanding is that, it did not filter out reachability and graspability as irrelevant predicates, as were the cases for the show and hide tasks. It understands that the reachability and graspability of the target-object by the target-agent should become easier.

7.6 Give an Object

The next task demonstrated to the robot was to give an object. The scenarios were similar to the make-accessible task, the only difference was that the performing-agent was holding the target-object at appropriate place in the space and waiting for the target-agent to take it, instead of putting the object on the support. For this task the end time stamps indicated to the robot were the moments when the target-agent takes the object from the performing-agent. There were a total of three demonstrations and the task understanding based on the leaf nodes of the m-estimate based consistent explanation sub-tree results into the following symbolic representation:

$$\begin{aligned} Task\left( Give \right) \leftarrow \nonumber \\ (Action\_to\_Reach=No\_Action\_Required)\wedge \nonumber \\ (Relative\_Posture=Maintained)\wedge \nonumber \\ (Action\_to\_Grasp=No\_Action\_Required)\wedge \nonumber \\ (Object\_Initial\_Motion\_Status=Static)\wedge \nonumber \\ (Object\_Final\_Motion\_Status=Static) \wedge \nonumber \\ (Object\_Relative\_Visibility\_Score=Increased) \wedge \nonumber \\ (Nature\_ Effort \_Class\_to\_See=Supportive) \wedge \nonumber \\ (Relative\_Hand\_Status=Manipulability\_Gained) \wedge \nonumber \\ (Object\_Initial\_Status=On\_Support)\wedge \nonumber \\ (Object\_Final\_Status=In\_Hand)\nonumber \\ \end{aligned}$$
(40)

Compared to the make-accessible task, the main differences (which in-fact are interrelated) in the understanding of the give task are: the target-agent should apply no action to reach and grasp the target-object, the target-object should be in the hand of target-agent and the manipulability of the target-object should be gained by the target-agent. Further, it encodes that the give task will not be finished until the target-object is in target-agent’s hand, whereas for the make-accessible task it is sufficient to make the target-object easier to be reached and grasped by the target-agent.

7.7 Put-Away an Object

Next we have demonstrated the task of put-away an object. There were four demonstrations in different situations. Following is the robot’s understanding about the put-away task:

$$\begin{aligned} Task(Put\_Away)\leftarrow \nonumber \\ (Relative\_ Effort \_to\_Reach=Becomes\_ Difficult )\wedge \nonumber \\ (Relative\_Posture=Maintained)\wedge \nonumber \\ (Object\_Initial\_Motion\_Status=Static)\wedge \nonumber \\ (Object\_Final\_Motion\_Status=Static)\wedge \nonumber \\ (Relative\_ Effort \_to\_Grasp=Becomes\_ Difficult )\wedge \nonumber \\ (Relative\_Visibility\_Score=Decreased)\wedge \nonumber \\ (Relative\_ Effort \_to\_See=Maintained)\wedge \nonumber \\ (Initial\_Hand\_Status=Object\_Free)\wedge \nonumber \\ (Final\_Hand\_Status=Object\_Free)\wedge \nonumber \\ (Object\_Initial\_Status=On\_Support)\wedge \nonumber \\ (Object\_Final\_Status=On\_Support)\nonumber \\ \end{aligned}$$
(41)

7.8 Hide-Away an Object

Next task to demonstrate was to hide-away an object. Again there were four demonstrations in different situations, resulting into the following understanding:

$$\begin{aligned} Task\left( Hide\_Away \right) \leftarrow \nonumber \\ (Relative\_ Effort \_to\_Reach=Becomes\_ Difficult )\wedge \nonumber \\ (Relative\_posture=Maintained)\wedge \nonumber \\ (Object\_Initial\_Motion\_Status=Static)\wedge \nonumber \\ (Object\_Final\_Motion\_Status=Static)\wedge \nonumber \\ (Relative\_ Effort \_to\_Grasp=Becomes\_ Difficult )\wedge \nonumber \\ (Object\_Final\_Visibility\_Score\approx 0.0) \wedge \nonumber \\ (Relative\_ Effort \_to\_See=Becomes\_ Difficult )\wedge \nonumber \\ (Initial\_Hand\_Status=Object\_Free)\wedge \nonumber \\ (Final\_Hand\_Status=Object\_Free)\wedge \nonumber \\ (Object\_Initial\_Status=On\_Support)\wedge \nonumber \\ (Object\_Final\_Status=On\_Support)\nonumber \\ \end{aligned}$$
(42)

Note that the above understanding of the hide-away task tries to inherit the properties of the hide and put-away tasks. For the hide task the abilities to reach and grasp were found to be irrelevant, whereas in the hide-away task, similar to the put-away task, these have been found relevant in the sense to make these abilities difficult for the target-object from target-agent’s perspective. For the put-away task, the robot found relative visibility score to be decreasing because of the relatively away position of the target-object, whereas for hide-away task the absolute visibility score itself has been found to be negligible from the target-agent’s perspective, as was the case for the hide task. Hence, the framework prunes the subtrees at different appropriate levels of abstractions.

In these resulted understandings of the tasks, the effects related to the abilities to reach and to grasp, wherever appeared in the learned concepts, were similar. This is because of the type of task demonstrated. However, if the robot will be demonstrated with the tasks such as, put the target-object to enable the target-agent just to touch it, the framework would successfully capture the independent and different effects about the abilities to reach and to grasp.

8 Performance Analysis

In this section, we will analyze the performance from the perspective of the average processing time and based on intuitive understanding of the tasks. Here, it is important to note that as mentioned earlier, the numbers of demonstrations mentioned in the following figures are representing the demonstrations by expert teachers, that too by manually achieving diversity among different demonstrations for the same task, for faster convergence. Following our analogy that we are not teaching a ’child’ with wrong demonstration, analysis of experiments with non-expert teacher is not the scope of the paper.

8.1 Processing Time

Table 3 shows number of demonstrations per task and the average processing time per demonstration (excluding the time of demonstration). It is interesting to observe that it takes more time for those tasks, which require the robot to apply more number of virtual actions on the target-agent for finding the least feasible effort for an ability. For example, in the case of the hide task, the target-object had been placed to be invisible by the performing-agent from the perspective of target-agent. In most of such cases, the robot needs to apply Whole_Body_Effort or even Displacement_Effort to find the least feasible effort to see or reach the target-object by the target-agent, after sequentially testing for lower effort levels. Whereas, for the tasks for which the least feasible effort of the target-agent is found by applying the virtual actions corresponding to lower-level efforts such as Head_Effort or Arm_Effort, the computation time is less. For example, for the give task, from target-agent’s perspective, least efforts to see and reach are both lower, hence, lower processing time.

Table 3 Number of demonstration per task and average learning/refinement time (in s) after each demonstration for each task

In fact, the convergence after each demonstration depends upon the total number of predicates in the domain theory, as even the learned concept appear to be pruned significantly, the robot maintains the m-estimate of all the predicates to incorporate the possibility of lifelong learning and confusion based refinement. This is a choice we have made. However, one could chose to refine only the tree learned so far and batch process the remaining data offline. This will make the system learn faster but needs to decide what to batch process and when.

8.2 Analyzing Intuitive and Learned Understanding

In this paper, we have tried to incorporate a subset of predicates for better human-level understanding of a task, but we cannot claim it to be a complete domain theory for understanding of such tasks. We do not follow a close world assumption and in fact we should not, as a task could have effects, which cannot be captured in the current hypothesis space, and could include other physical, mental and emotional states of the target-agent, his/her desire and so on. For example, does making accessible require the target-agent to maintain his/her posture or his/her foot to be on ground, should the target-object always be visible more easily or is it acceptable to compromise it for the ease of reachability. Therefore, it is practically not possible to figure out an exact ground truth model, hence, to define a domain theory, which will be “complete” or “accurate”.

The question of “correct tree” will probably be unanswered until we will have a “complete” domain knowledge and universally accepted “precise semantics” of tasks. Hence, it is difficult to design an objective function for analysis and comparison. Therefore, in this paper we can only intuitively say something about a “partially correct” explanation of a task for comparison purpose.

Since, the focus of the paper is on understanding tasks, which are performed for other agents, in this section we have chosen the predicates related to the effect on target-agent’s ability and the effect on target-object’s state to analyze the “partial correctness”. For this, based on an intuitive understanding of tasks, we have compared the robot’s understanding. The intention is to demonstrate the strength of the presented framework that it can reach such human-level understanding even from two positive demonstrations, if demonstrated as differently as feasible. Table 4 summarizes this comparison. Note that for reaching the understanding at higher level abstraction such as supportive, it takes more number of trials than the lower level predicates based understanding, such as directly, which maps to \( No\_Effort \), a value corresponding to lower level in the effort hierarchy. This is obvious, as the pruning of the sub-trees are bottom up for the sake of understanding a task at appropriate level of abstraction and generalization. Therefore, generally the number of demonstrations to conclude non-relevance accumulates as the level of abstraction goes up.

Table 4 Analyzing the key attributes’ understanding by the robot with an intuitive understanding for a task

It is worth to note that even though as explained earlier, it is difficult to have an objective function and correct tree to compare, the intuitive understanding used in this section and the analysis of the result from different perspectives as done in the columns of Tables 3 and 4 could serve to analyze and compare the results of future works on understanding such tasks.

8.3 Necessity of Ability Based Facts Based and Perspective Taking

Let us assume that the hypothesis tree does not contain any attribute related to the ability based on the perspective taking of the target-agent. In that case, the understood semantics for the tasks of Make Accessible and Put Away are:

$$\begin{aligned} Task\left( Make\_Accessible \right) \leftarrow \nonumber \\ (Relative\_Posture=Maintained)\wedge \nonumber \\ (Object\_Initial\_Motion\_Status=Static)\wedge \nonumber \\ (Object\_Final\_Motion\_Status=Static)\wedge \nonumber \\ (Initial\_Hand\_Status=Object\_Free) \wedge \nonumber \\ (Final\_Hand\_Status=Object\_Free)\wedge \nonumber \\ (Object\_Initial\_Status=On\_Support)\wedge \nonumber \\ (Object\_Final\_Status=On\_Support)\end{aligned}$$
(43)
$$\begin{aligned} \nonumber \\ Task(Put\_Away)\leftarrow \nonumber \\ (Relative\_Posture=Maintained)\wedge \nonumber \\ (Object\_Initial\_Motion\_Status=Static)\wedge \nonumber \\ (Object\_Final\_Motion\_Status=Static)\wedge \nonumber \\ (Initial\_Hand\_Status=Object\_Free)\wedge \nonumber \\ (Final\_Hand\_Status=Object\_Free)\wedge \nonumber \\ (Object\_Initial\_Status=On\_Support)\wedge \nonumber \\ (Object\_Final\_Status=On\_Support) \end{aligned}$$
(44)

Hence, these two tasks, which are opposite in nature are not semantically separated, because the core difference is from the perspective of the target-agent on the ability to reach the object. Therefore, the reasoning about abilities from the target-agent’s perspective is a must for understanding those types of tasks in which an agent is performing for another agent.

8.4 Importance of Levels of Abstraction

Let us take the example of the learned semantics of the tasks hide and show. In the show task, the sub-tree related to the ability to see the target-object has been preserved at the lowest level of target-agent’s required action (Action_to_See = No_Action_Required). This results into an intuitive notion that the target-agent should not put any effort to see the object. Whereas, for the hide task, the same sub-tree has been pruned to relative effort class level (Relative_Effort_Class_to_See = Becomes_Difficult), which results into an understanding that the target-object should become difficult to be seen by the target-agent, and the effort to see can vary depending upon the situation. Similarly, for the task of make accessible the same sub-tree has been pruned at the highest level, i.e. the nature of the effect, (Nature_Effort_Class_to_See = Supportive), which can also be defended as initially the object might be hidden or already visible and the making accessible might increase the visibility or at least maintain it, while reducing the effort to be reached by the target-agent. Hence, even there is inconsistency in the values at relative effort level in the hierarchy, the nature/intention has been captured at the higher level of abstraction.

In the above example if we do not have the leaves at action level, the difference between the show and the hide tasks would have been still captured at the relative effort class to see (i.e. easy or difficult to see). However, the finer notion that the target-agent should put no effort to see the object in the case of show task would be lost. Similarly, if we do not have the qualifying the level of abstraction, nature, the inconsistency at the level of relative effort to see would result into an understanding, which will consider the ability to see the object as an irrelevant predicate. Hence, the design of the knowledgebase with significant levels of abstractions is important to capture the intuitive semantics of the tasks, to avoid missing some relevant predicates as well as to avoid losing some finer details.

8.5 Remark on Practical Limitations

The practical issues related to inference of various facts limit the output of the presented framework. For example, the object_in_hand fact is inferred if the object is not on a support and the hand is close to the object. This limits estimating how the object is being grasped by the agent, i.e. what is the relative position and orientation of the hand with respect to the grasped object. This further limits the understanding about dual grasp, i.e. there should be sufficient space available to grasp the object simultaneously by another human. Therefore, for the give task we have deliberately chosen the object of bigger dimensions. This always leaves sufficient space, so that the robot can positively infer a collision free grasp by the target-agent. Another limitation arises from the localization of the object. Since it is based on the tag on the object, the performing-agents were instructed to manipulate the object such that the tag always faces towards the camera of the robot.

9 Potential Applications and Benefits

The symbolic understanding of a task along with its geometric counterpart makes the robot more ’aware’ about a task. In this section we discuss some of the potential applications.

9.1 Generalization to Novel Scenario

The understanding of a task is independent of the shape and size of the object, the trajectory as well as the absolute/relative distances among the agents and objects. This facilitates the robot to perform the task in an entirely different scenario. For example, the robot will be able to perform the task of making an object accessible by putting it at the top of another object, even it might have not been demonstrated to the robot.

9.2 Greater Flexibility for Symbolic and Shared Cooperative Plan Generation

If the planner at symbolic level knows the semantics of a task independent of the actions to be executed, it could plan to achieve the task in a variety of ways. Such as, if it ’understands’ that hiding means the target-object should be difficult to be seen by the target-agent, depending upon the situation, it could plan to cover the target-object with some other container type object to make it invisible. Similarly, for showing or making an object accessible, again instead of directly manipulating the target-object, it could plan to displace the occluding or obstructing object from the human’s perspective to achieve the same desired effects. The task-planner could even involve a third agent to achieve the task.

Goal effect based task planners, such as [2, 3, 8, 17], and affordance based task planners [5, 48], could take into account high-level constraints and generate alternative solution in terms of actions, agents and objects for a task. Our ongoing work in this direction, such as [17, 18, 48], demonstrates the feasibility of a hybrid planner capable of generating cooperative shared plans for achieving a task in human–robot interaction domain, based on combined geometric and symbolic reasoning. It is able to take into account the aspects such as which agent can do which type of action, what are the associated levels of efforts, constraints and preferences. Nevertheless, it will be an interesting future work to incorporate the complementary aspects of action learning from demonstrations, and incorporate that through a mechanism to reduce the complexity or at least guide the search space of such task planner.

9.3 Transfer of Understanding Among Heterogeneous Agents

Since the robot understands the task independent of the trajectory planning and control level execution, it can easily transfer the task semantics to another robot of entirely different kinematics structure and shape. And the other robot equipped with similar reasoning capabilities could then interpret the understanding and perform it by respecting its own constraint of whole body planning.

9.4 Feasibility of Reproducing the Learned Task

As the learning is not at trajectory or sub-action level, so it is not possible to reproduce the task by imitation approaches. In fact, the domain of effect-based task planning is a complementary research topic.

As shown in Fig. 20 a task learned in terms of desired effect could be reproduced in different ways:

  1. (i)

    Effect-to-Parameter Converter: By using a converter, which will interpret the effect and convert it into the parameters of a geometric task planner.

  2. (ii)

    High-Level task Planner: By using a High-Level task planner, such as HATP [2] and other as discussed in Sect. 9.2 to find the actions, which could result into the desired learned effect the world. The advantage of using such high level symbolic planner is, it can plan to achieve same effect in different ways, as discussed in Sect. 9.2. As shown in dotted link, a two-way handshaking between the low-level planner and the high-level planners, such as [3, 17] could better converge to a feasible plan, based on desired effect and effort.

In [47] we have presented a framework to find the candidate places to perform a task, once the semantics of the task is known in terms of effort based ability to see and reach. In [49], we have presented a framework to plan for such basic HCOM tasks. In [51] we presented an integrated planner based on constraint hierarchy, for planning a set of basic pick-and-place tasks and presented experimental results. The core of the planner takes as input three sets: candidate places, candidate grasps, candidate orientation of the object. Further, it takes the range of visibility score, need of dual grasp as input parameter, to test for feasibility of a solution.

As the domain of the demonstrated tasks of the current paper is within the scope of our planner presented in [51], we have implemented a simple converter to ground the learned effects in terms of the parameters of this planner. This converter has different sub-modules to convert different predicates. For example, to convert the symbolic value “easier” for the ability “reach” of the target-agent, it finds the target-agent’s current least effort, \(\hbox {C}_{LE}\) to reach the target-object. Then it finds the places, which are reachable by less effort levels than \(\hbox {C}_{LE}\) by the target-agent, using approach in [47]. These places serve as the input set of candidate places. Similarly, the value of desired visibility score is directly passed. If the object status should be on_support, this is also passed directly, as the framework is capable of finding set of stable placement on any support plane. As the effect of task understanding is at a level of abstraction, which makes it independent from the kinematics of the agent, we have taken this advantage to feed the converted constraints for planning for the hide task by PR2 robot, although learned by JIDO robot. Fig. 21 shows PR2 robot reproduces the task of hiding a target-object (grey-tape) form a target-agent (human) in a different scenario.

Fig. 20
figure 20

Different ways to reproduce a learned task, based on desired effect. Either the effect could be converted in terms of constraints for a geometric pick-and-place task planner or could be used by high-level symbolic planner to produce entirely different plans depending upon the situation

Fig. 21
figure 21

Using the effect to parameter converter, learned effects of the hide task has been converted into constraints comparable with our geometric planner for human-centered tasks [51]. Using those, the PR2 robot reproduces the task of hiding a target-object (Grey tape) from a target-agent (Human) in a different scenario

Note that the planner of [51] only serves as an example, which could be used in geometric planner block of Fig. 20. For a different geometric planner, an appropriate effect to parameter converter will be required.

9.5 Understanding by Observing Heterogeneous Agents

Our system is capable of effort based visuo-spatial perspective takings analysis not only for human but also for different robots, JIDO, HRP2, PR2. Therefore, using the same framework and system, the robot could understand task semantics from the demonstrations in which the target-agent is a robot. This facilitates understanding for the situation in which the human might perform the task for the observer robot itself or for another type of robot.

9.6 Generalization for Multiple Target-Agents

Another interesting future research work is to incorporate the aspect of multiple target-agents. Such as, hide an object from two humans at the same time, show an object to a group of people, etc. There will be two different aspects:

  1. (i)

    Adapting the understanding to perform for multiple target-agents: The symbolic level understanding of tasks will facilitate easy generalization for planning for multiple target-agents, but will depend upon the capability of the task planner to plan and perform the task for multiple target-agents.

  2. (ii)

    Understanding from the demonstrations involving multiple target-agents: From the point of view of understanding the task, which itself has been aimed for multiple target-agents, the framework can be adapted to build and maintain multiple hypotheses trees for different target-agents. The underlying research challenge will be merging such hypotheses tress to autonomously come up with a coherent understanding.

9.7 Generalization for Multiple Target-Objects

In the current implementation, we focused on the tasks requiring a single target-object of interest. There could mainly two ways in which multiple target-objects could be important for a task:

  1. (i)

    Without any relation or ordering: For example, giving something to write, which required giving pen and paper. For this our current framework can be adapted, so that the robot will reason on the effects for different target-objects, independent of each other. This might also lead to autonomous reasoning about identifying the target-objects. In the current approach, the name of the target-object is explicitly provided to the robot. Works on autonomous learning on task-relevant objects such as [35] could be adapted for this purpose.

  2. (ii)

    There are some semantics associated with the relation and ordering between the objects: For example, serving coffee, which will require to put the cup above the plate. For this, it will require integrating the presented approach with the works on learning ordering and preferences, such as [52].

9.8 Facilitate Task/Action Recognition and Proactive Behavior

By partially observing the human’s action and its effect, the robot could probabilistically classify the task or the desired changes in the ongoing action, by the human. Even the complete task is known to the robot, again based on the desired effects of the task, the robot could show proactive behaviors to partially/fully facilitate the task while reducing the human’s effort. For example, if the robot infers/knows that the human is trying to reach an object, it could proactively offer help by making the object accessible. Similarly, if the robot knows at symbolic level that the human wants to show or give an object to it, knowing the ’meaning’ of task, the robot could proactively turn its head or move its arm to facilitate achieving the desired effect as an attempt to guide as well as support the task [50].

9.9 Enriching Human–Robot Natural Interaction

Such symbolic awareness about the task’s semantics could also enrich the interaction with the human partner to be more natural, as the robot will be able to communicate the task at the level of abstraction understandable by the human.

9.10 Understanding Other Types of Tasks

The focus of the paper is on basic HCOM tasks, which requires one agent to perform the task for the other agent. We do not claim that with its current knowledgebase the system can learn every manipulation task. However, the presented knowledge and the learning framework can help in augmenting the knowledgebase for effect-based understanding of various other types of tasks, such as tap, lift, drop, dump in trash bin, throw an object, etc. For example after tapping a ball initially laying static on a support, the final state of the ball is moving on support, hence, the effect contains object symbolic status: \(on\_support\), motion status: moving. Similarly, for throw task the effect contains object symbolic status: \(in\_air\), motion status: moving. Similarly, for lift task, the object status \(on\_support\) will be lost and object status \(in\_hand\) will be gained, hence, manipulability gained, etc. Certainly more predicates and reasoning about the dynamic effects will be required to understand the complete effect of such tasks, such as dump, which requires the notion of “place” or “bin”. However, if the domain theory and the knowledgebase are rich enough (either engineered or learnt by some ontological representation) the presented framework could be used to autonomously prune out irrelevant facts, associated either with the human or with the object.

10 Conclusion and Future Work

This paper serves as a step towards making the boundary between task primitives and execution primitives visible and enables a human-centered robot to understand task semantics independent of the means to achieve it. This is an important aspect of emulation learning, which facilitates us, the humans, to efficiently perform a task differently in different situations.

In this paper we focused on those tasks, which require one agent to perform a task for another agent, e.g. show, hide, give, etc. The main novelty of the paper is that we have argued and demonstrated that it is a must to include multi-state visuo-spatial perspective taking, i.e. combining the reasoning about effort, ability and perspective taking for understanding such tasks by a robot. We have identified the key attributes, which enable the robot to reason on various quantitative, comparative and qualitative facts for analyzing the effect of demonstrated tasks. An explanation based learning (EBL) framework, incorporating m-estimate based consistency measure, which introduces the notion of ’experience’, has been presented for understanding task semantics from demonstration. All these equip the robot with the ability to learn and explain its understanding in a ’meaningful’ way. We have demonstrated that the robot autonomously learns the intuitive semantics of a set of basic human-centered tasks at appropriate levels of abstractions and argued that such understanding could be generalized to novel scenarios and heterogeneous agents.

A work in progress is to integrate the learning framework with goal-effect based task planners to reproduce the task in an entirely different way. Further, the presented approach could benefit from various other attributes to understand more complex tasks.

The presented framework currently reasons at initial and final world state \(\left\langle WI, WF \right\rangle \) to understand a task based on its effect. An interesting future work is to include the action component A and reason based on \(\left\langle WI, A, WF \right\rangle \). This will enable the robot to analyze the rich information about execution preferences and to learn those tasks for which some sub-action primitives might be essential parts for understanding the desired effect of the task. Further, it will be interesting to investigate, how the framework could be adapted to understand the undesirable effects of a task.