Keywords

1 Introduction

Robots of the future should be Smart: versatile and efficient Artificial Intelligence systems able to perform behaviors of growing complexity, adapt to changes, collaborate with humans and other robots, learn from the past and from action performed by other agents, and build on their capabilities based on that knowledge. Two are the reasons that limit robots intelligence: the limited capacity of on board data elaboration and the absence of a common medium to communicate and share knowledge. Tapping into the Cloud is the solution. A Cloud server allows to offload robots from CPU-heavy tasks and to perform intensive computation while meeting the hard real-time constraints of operations. A Web Ontology lets the definition of a common vocabulary that ensures a common understanding during the interaction as well as an efficient data transfer and integration.

In automation, a large amount of objects should be real-time manipulated. Failures can led to high costs. Moving data and computation to the Cloud favors data sharing [1], enables robots learning the stability of finger contacts from previous manipulations on the same object [2], and lets the application of strategies used on some objects on similar parts encountered later [3].

This paper proposes an Open Semantic Framework for knowledge acquisition of cognitive robots performing manipulation tasks. An Ontology and a Cloud-based Engine have been implemented. The former stores data about objects and actions necessary to manipulate them. The latter detects objects in the scene and retrieves their manipulation actions from the Ontology. If no ontological data exists, the Engine generates and stores it on the Ontology.

The rest of the paper is organized as follows. Section 2 details researches done in this context and compares existing works with the one proposed by this paper. Section 3 describes the Ontology design and its implementation details. Section 4 shows the Cloud-based Engine focusing on the intelligent ontological data insertion and retrieval. It also highlights how the elimination of the training phase for object recognition and manipulation and its substitution with a human-robot cooperation makes gradual the robots knowledge growth. Section 5 proves the good performances of the proposed system by discussing the experiments done. Section 6 presents conclusions and future work.

2 Related Work

Different Cloud Platforms exist. Their inadequacy for robotics scenario is mainly due to the difference in Web applications and robotics applications. Many Web applications are stateless, single processes that use a request-response model to talk to the client. Robotic applications are state-full, multi-processed, and require a bidirectional communication with the client. An example of efficient and widespread Cloud Platform which, however, is not suited for robotics applications, is the Google App Engine.Footnote 1 It exposes only a limited subset of program APIs tailored specifically for Web applications, allows only a single process, and does not expose sockets, which are indispensable for robotic middlewares such as the open-source Robot Operating System (ROS) [4]. In order to overcome these limitations, some Cloud Robotics Platforms have been implemented. An example is Rapyuta, the RoboEarth Cloud Engine [5], a platform designed for robots to share data and action experiences with each other. With respect to Rapyuta, the proposed Engine focuses only on the robotics sharing of manipulation data and actions, but guarantees an efficient Cloud data access by adopting a cascade hashing algorithm [6]. Moreover, it avoids data duplication during the insertion of new data by using a novel powerful interlinking algorithm [7]. The algorithm finds the interlinking pattern of two data sets by applying two machine learning methods: the K-medoids [8] and the Version Space [9].

Focusing on the Knowledge Base to which the Engine accesses, many existing works are available online. Examples are the Columbia Grasp dataset [10], the KIT object dataset [11], and the Willow Garage Household Objects Database [12]. KnowRob [13], the knowledge base of RoboEarth [14], is the most widespread. They stores information about objects in the environment and their grasp poses. The Household object database is a simple SQL database: the SQL format does not favor robotics knowledge scalability. The others are well-defined by an Ontology. RobotEarth models objects as 3D colored Point Clouds [15], the others store objects as triangular meshes. Stored items are of high quality but each object model consists of several recordings from different point of views; thus requiring either a lot of manual work or expensive scanning equipments.

With respect to the existing Knowledge Bases, the one proposed, named RTASK, is scalable because of the adoption of an Ontology that defines data. It guarantees an intelligent data storage and access because of the type of data saved. Every object is characterized by multiple visual features (2D Images, B-Splines, and Point Clouds). When detecting an object, the recognition process starts the comparison of the smallest features (e.g. the ones representing the 2D Images), and eventually expands to the others (B-Splines and Point Clouds, in increasing order). No onerous manual work is required to store objects from different view points: an object is stored even if there exists only a single registration of one its views. A human teacher helps robots in recognizing objects when viewed from other orientations. The teacher exploits the connection between the new view point and other object properties, e.g., name and function. These new features will be stored in the Ontology gradually incrementing robots knowledge about the object itself. Moreover, the proposed Ontology observes the IEEE standards proposed by the IEEE Robotics and Automation Society (RAS)’s Ontology for Robotics and Automation (ORA) Working Group (WG) by extending the Knowledge Base it proposed (see Fig. 1) [16,17,18]. Respecting the standard, Balakirsky et al. [19] proposed a kitting ontology. RTASK generalizes the manipulation concept by introducing the notions of manipulation task and action. This means that any manipulation task can be represented, e.g., grasps and pushes.

Fig. 1
figure 1

The RTASK extension to the IEEE ontology for robotics and automation

Fig. 2
figure 2

RTASK: The ontology

3 The Ontology

RTASK formulates a common vocabulary for robotics manipulation. As proposed in [20], it separates the concepts of tasks and tasks executions. Tasks are abstract entities that describe goals to be reached; while tasks executions are events composed of actions that are performed by robots in order to reach goals.

3.1 Design

Figure 2 depicts the Ontology design: a Task is assigned to an Agent, e.g., a Robot. It should be executed within a certain time interval and requires the fulfillment of a certain Motion in order to be performed. Manipulation is a sub-class of Task. Several types of manipulations exist, e.g., grasps and pushes. They involve the handling of an Object located at a certain Pose (Position and Orientation) through the exectution of a Manipulating action. If the Task is assigned to a Robot, then the Motion will be represented by a Robot Action. In detail, the Robot Manipulation Action involves the activation of the robot End Effector. Studies have demonstrated that: (i) placing the arm in front of the object before acting improves actions; (ii) humans typically simplify the manipulation tasks by selecting one of only a few different prehensile postures based on the object geometry [21]. On this view, the End Effector is first placed at a Pose p’ at a distance d from the Object, with its actuable Joints at a certain Pre manipulation Posture. Then, the End Effector is placed at a Pose p close to the Object and the effective manipulation Posture is assigned to its Joints.

In order to retrieve the manipulation data of an object in the scene, the object should be recognized as an instance previously stored in the Ontology. For this purpose, every Object is characterized by an id, name, function, and the visual features obtained by the Sensors. For every Object, RTASK stores multiple types of visual features: 2D Images, B-Splines, and Point Clouds.

3.2 Implementation Details

RTASK is represented through the union of the Resource Description Format (RDF)Footnote 2 and the Web Ontology Language (OWL),Footnote 3 namely OWL Full. RDF is used to define the structure of the data, OWL adds semantic to the schema and allows the user to specify relationships among data. OWL Full allows an ontology to augment the meaning of the RDF vocabulary guaranteeing the maximum expressiveness of OWL and the syntactic freedom of RDF. Indeed, OWL is adopted by the World Wide Web Consortium (W3C)Footnote 4 and it is the representation language used by the RAS ORA WG. Protégé is used as ontology editor.Footnote 5

Queries allow robots to investigate the knowledge base and retrieve existing data. A robot able to query the database has the capability of efficiently and intelligently perform tasks. In our case, a C++ interface lets ROS users query RTASK using SPARQL.Footnote 6 Apache Jena Fuseki is used as SPARQL server.Footnote 7

4 The Cloud-Based Engine

The current implementation of the Cloud-based Engine is based on a Cloud-based Object Recognition Engine for robotics (CORE) [22]. The robot has an internal ROS node that receives the segmented objects (objects are segmented using the functions offered by the Point Cloud Library [23]) and sends them to the Cloud-based Engine. The Engine is composed on another ROS node capable of reading the content of received messages. The communication is based on the ros_bridge interface, which provides a web socket channel between nodes.

4.1 Data Retrieval

4.1.1 Objects Manipulation Data Request

The robot asks the Cloud Server for the retrieval of the manipulation data of an object. It sends a message containing the type of its gripper and the compressed Point Cloud of the manipulable object. The compressed Point Cloud representation saves space and connection time. The compressed Point Cloud is encoded in order to be transmitted over the Web Socket channel. After the encoding, the whole message is represented as a JavaScript Object Notation (JSON) object. The ROS message being encoded on the Web Socket request follows.

figure a

The Server receives the Client request and performs a super-fast search of the object inside RTASK. The search starts from the comparison of the object 2D Images features (SIFT, Scale Invariant Feature Transform) [24] and, in case of mismatch, ends with the comparison of its Point Clouds features (HOG, Histogram of Oriented Gradients) [25]. Steps of the super-fast search of SIFT features follow (see Fig. 3).

  1. 1.

    Decode the message and decompress the Point Cloud;

  2. 2.

    Convert the Point Cloud to color image using the Open-source Computer Vision (OpenCV) [26] (OpenCV) functions;

  3. 3.

    Extract the image SIFT featuresFootnote 8 and store them on a binary file of a Server folder;

  4. 4.

    Match the features with the ones stored in the Server;

  5. 5.

    SPARQL query RTASK;

  6. 6.

    The Server returns a moveit_msgs::Grasp ROS message containing the relative manipulation data.

Fig. 3
figure 3

Image processing pipeline of the matching phase

The search is fast because of the novel and super-fast Cascade hashing algorithm adopted during the matching phase (Step 4) [6] and because of the way in which features are stored. The algorithm allows constructing a dataset without a learning phase: there is not need to train the hashing function as in other Approximate Nearest Neighbor (ANN) methods. Given the input image, the function returns the names of its most similar images features (according to the SIFT parameters), the names are in the form of integer numbers. The same names define the object classes of RTASK. Moreover, features of stored objects are precalculated and stored on a Server folder: they are not calculated at every data set access. Thanks to the combination of these two characteristics only one query at the end of the matching process is needed in order to retrieve the similar objects stored on the Ontology.

4.2 Data Insertion

An example of Client message aiming to create a new Object instance follows. It is encoded as JSON string

figure b

When the Server receives the message, it calculates the relative SIFT features and stores them on the features folder and on RTASK.

figure c

During the insertion of elements in RTASK, duplication avoidance is desirable. To this end, Algorithm 1 is applied to automate the interlinking process. It was proposed in [7] and finds out the interlink pattern of two data sets by applying two machine learning methods: the K-medoids and the Version Space. Although interlinking algorithms require interactions with users for the sake of the interlinking precision, computations of comparing instances are largely reduced than manually interlinking. As the work-flow of Algorithm 1 shows, when interlinking two instances across two data sets D and D’, the algorithm first computes property/relation correspondences across two data sets (line 5). Then, instances property values are compared by referring to the correspondences (line 10). A similarity value v is generated upon all similarities of property values (line 11). If such a similarity is equal to or larger than a predefined threshold T, the two compared instances can be used to build a link with the relation owl:sameAs (line 12–14).

Fig. 4
figure 4

The reasoning pipeline. The blue parts represent human interventions

4.3 Human-Robot Interaction

Usually, an onerous a-priori human manual work is required to store objects visual features on a Robotics Knowledge Base: every object is represented by a large amount of registrations from different points of view. Objects representation is accurate but the a priori work is onerous. Our approach eliminates this prerequisite by introducing a human teacher that supports robots during their learning phase. First registrations will not be accurate, but knowledge will gradually increase until becoming absolute and giving robots autonomy.

Figure 4 highlights the reasoning at the bottom of this cooperation approach. Robots require human intervention to confirm the identity of a recognized object or assign one to a new object. Moreover, humans help robots in connecting visual features of objects already stored in the Knowledge Base but seen from other points of view. To perform the connection, other objects properties are exploited, e.g., name and function.

Many advantages are introduced by this approach. For example, it eliminates the a priori work currently done by humans by introducing a cognitive and social robot able to interact with other agents, e.g., human operators. A key outcome follows: A robot capable of learning is a flexible, adaptable, and scalable cyber-physical system.

The message used for human-robot interaction follows.

figure d

The human operator shows an object to the robot and gives it a description (e.g., coke, can, pen). The Server seeks for a similar object by filtering the Ontology through the object description. If matches exist, the Server returns a message containing the classes of the similar objects found:

figure e

a human feedback confirms the object class and visual features are added to it. Otherwise, a new instance is inserted in RTASK:

figure f

4.4 Manipulation Data Generation

Given new object, the Cloud-based manipulation planner generates a list of possible manipulations, each consisting of a gripper pose relative to the object itself. Manipulations consist of both grasps and pushes. The current version of the generator aligns the hand with the object principal axes, starting from either the top or the side of the object, and tries to manipulate it around its Center Of Mass through a trial-and-error Reinforcement Learning technique. As for GraspIt!  [27] and the MoveIt! Simple Grasps tool developed by Dave T. Coleman,Footnote 9 given the safety distance d at which the gripper must be positioned before making the manipulation, the generator returns:

  • a pre-manipulation configuration: the gripper pose and joints configuration at the safety distance;

  • a manipulation configuration: the gripper pose and joints configuration to be maintained during the manipulation.

Experiments proposed by Dave T. Coleman demonstrate that the grasping tool he developed does not fail because, in the scene

  • the block to be grasped is known a priori;

  • the configuration that the gripper has to maintain during the grasp is manually predetermined according to the block’s dimensions;

  • collision checking is not performed to verify the feasibility of grasps.

If we reason about arbitrary shapes, collisions or contact losses can be induced by pre-selecting the manipulation configurations. To overcome the problem, the generator used by the Engine exploits the Reinforcement Learning benefits and generates grasps and pushes according to the input object.

5 Experiments

Experiments aim to provide truthful temporal results on Cloud access and ontological data retrieval. For this purpose, RTASK and the Engine have been integrated inside the Cloud environment of CORE, which is available on the Wisconsin CloudLab clusterFootnote 10 under the project “core-robotics”. It consists of one x86 node running Ubuntu 14.04 with ROS Indigo installed. Moreover, RTASK has been integrated with the Object Segmentation Database (OSD)Footnote 11: a data set of 726 MB currently containing 111 different objects, all characterized by a 2D color Image and a Point Cloud.

In simulation, an Husky mobile robot equipped with a Universal Robot UR5 manipulator and a Robotiq 2 finger gripper has to solve a tabletop object manipulation problem: it has to grasp the nearest object located on the table in front of it. A Microsoft Kinect acquires the scene. Gazebo [28] is used as simulator. A ROS Sense-Model-Act framework has been implemented in order to give the robot the ability to detect objects on the table, access the Cloud server, and retrieve manipulation data (see Fig. 5). Implementation details follow.

Fig. 5
figure 5

The sense-model-act framework

Sense The Kinect acquires the RGB-D images of the environment. From the collected data, a compressed segmented 3D Point Cloud of each individual manipulable object is computed using PCL.

Model On the Cloud, from the compressed Point Cloud, the relative 2D images is computed together with its B-Spline [29] representation. The relative visual features are computed, e.g., SIFTs and HOGs. The cascade hashing algorithm starts searching for a match between the features of the segmented object and that saved in the data set. The comparison starts from SIFTs and eventually expands to B-Splines (by computing the squared distance among points) and HOGs. The reasoner accesses RTASK in order to retrieve the manipulation actions relative to the detected features. Again, if a match exists, together with the information relative to the assigned manipulation action (e.g., push or grasp) and to the relative gripper joints configuration, then the information will be outputted. Otherwise, the Manipulation Data Generator computes the necessary poses.

Act The module lets the robot move by activating its simulated engines. MoveIt!Footnote 12 generates the kinematic information required for the system to pass from the current to the goal configuration. During the motion, the information acquired by the sensors is used to compare the system final state with the expected one. In case of mismatch, a trial and error routine starts correcting the joints configuration. The configuration that allows the task achievement is saved in RTASK.

5.1 Results

Figure 6 shows the system in operation: the robot detects the surrounding environment, segments the objects in front of it, and grasps the nearest objects through the manipulation configuration retrieved from the Cloud.

Tables 1 and 2 reports the most significant time data. Table 1 depicts the time taken for the extraction of the descriptors of the 2D images stored in the data set. The table shows two types of extraction: the extraction of descriptors of all the 111 images contained in the data set and that of descriptors of a single image. For each type of extraction, tests were done on 1, 2, 3, and 4 threads respectively. Authors point out the 0.387422 s used to extract descriptors of a single image through the employment of 4 threads. Table 2 focuses on times taken for features matching. By using 300 inliers, the match is accurate and the computational time does not affect the real-time constraints of a robotics manipulation.

The reported computational times consider SIFT features. Times gradually increments if B-Splines or HOGs are considered. Moreover, the computational effort depends on the richness of the features: the more complex the Point Cloud is, the greater the extraction time will be. The segmentation helps maintaining computational times low.

During the experiments, MoveIt! took on average 3.765 s to find a feasible inverse kinematics solution (best case: 0.162757 s; worst case: 7.367439 s; number of trials: 100) on a Dell Intel Core i7-4470 CPU @ 3.40 GHz x 8, 15.6 GiB Memory, 970 GB Disk. In the best case, it completed the planning after 0.8486 s. Reported data proves that the intelligent and efficient structure of the proposed Open Semantic Framework does not adversely affect the time required to complete a manipulation: executing the features extraction and matching on 4 threads off-loads robots and increases system performances.

Fig. 6
figure 6

a The robot in the scene; b The detected scene; c The segmented objects; d The grasp of the nearest object

Table 1 Elaboration time taken for features extraction
Table 2 Elaboration time taken for features matching

6 Conclusion and Future Work

This paper presented an Open Semantic Framework able to increase robots knowledge and capabilities on objects manipulation. The Framework is composed of an OWL Ontology and a Cloud-based Engine. From the study of human actions when handling objects, the Ontology formulates a common vocabulary that encodes the robotics manipulation domain. The Engine, instead, was developed in order to transfer the computation on the Cloud: it off-loads robot CPUs and speeds up the robots learning phase. Given an object in the scene, the Engine retrieves its visual features and accesses the Ontology in order to extract the corresponding manipulation action. If no information is stored, a Reinforcement Learning technique is used to generate the gripper manipulation poses that will be stored on the Ontology. The Ontology respects the IEEE Standard by extending the existing CORA. The Engine minimizes visual data processing through an intelligent ontological data access and retrieval. During ontological data retrieval, a cascade hashing algorithm is adopted in order to optimize the comparison between saved and new visual features. Instead, in order to avoid data duplication during the insertion of new instances in the Ontology, a novel efficient interlinking algorithm has been adopted. Furthermore, the training for objects recognition and manipulation is replaced by a human-robot interaction.

We proved the efficiency and effectiveness of the proposed approach by building a ROS Sense-Model-Act framework able to associate manipulation actions to the features of the objects in the scene. Tests were performed in simulation.

As future work, we aim to extend the proposed Ontology by defining other tasks and actions, e.g., we would like to explore the Navigation domain. We aim to assign robots the new tasks and execute the relative new actions in order to increase the capabilities of the Cloud-based Engine. Moreover, we are developing a new Reinforcement Learning technique for the generation of the manipulation configurations.