Keywords

1 Introduction

The usage of robots has been increasing in the industry for the past 50 years [1], specially in repetitive tasks. Recently, industrial robots are being deployed in applications in which they share (part of) their working environment with people. Those type of robots are often referred to as Cobots, and are equipped with safety systems according to ISO/TS 15066:2016 [2]. Although Cobots are easy to setup and program, their programs are usually written manually. If there is a change in the position of objects in their workspace, which is common when humans also interact with the scene, their program needs to be adjusted. Therefore, to increase flexibility and to facilitate the implementation of robotic automation, the robot should be able to adjust its configuration in order to interact with objects in variable positions.

A Robot manipulator consists of a series of joints and links forming the arm, at the far end are placed the end-effectors. The purpose of an end-effector is to act on the environment, for example by manipulating objects in the scene. The most common end-effector for grasping is the simple parallel gripper, consisting of two-jaw design.

Grasping is a difficult task when different objects are not always in the same position. To obtain a grasping position of the object, several techniques have been applied. In [3] a vision technique is used to define candidate points in the object and then triangulate one point where the object can be grasped.

With the evolution of the processing power, Computer Vision (CV) has also played an important role in industrial automation for the last 30 years, including depth images processing [4]. CV has been applied from food inspection [5, 6] to smartphone parts inspection [7]. Red Green Blue Depth (RGBD) cameras are composed of a sensor capable of acquiring color and depth information and have been used in robotics to increase the flexibility and bring new possibilities. There are several models available e.g. Asus Xtion, Stereolabs ZED, Intel RealSense and the well-known Microsoft Kinect. One approach to grasping different types of objects using RBGD cameras is to create 3D templates of the objects and a database of possible grasping positions. The authors in [8] used dual Machine Learning (ML) approach, one to identify familiar objects with spin-image and the second to recognize an appropriate grasping pose. This work also used interactive object labelling and kinesthetic grasp teaching. The success rate varies according to the number of known objects and goes from 45% up to 79% [8].

Deep Convolutional Neural Networks (DCNNs) have been used to identify robotic grasp positions in [9]. It uses RGBD image as input and gives a five-dimensional grasp representation, with position (xy), a grasp rectangle (hw) and orientation \(\theta \) of the grasp rectangle with respect to horizontal axis. Two DCNNs Residual Neural Networks (ResNets) with 50 layers each are used to analyse the image and generate the features to be used on a shallow CNN to estimate the grasp position. The networks are trained against a large dataset of known objects and their grasp position.

Generative Grasping Convolutional Neural Network (GG-CNN) is proposed in [10], a solution fast to compute, capable of running real-50 Hz. It uses DCNN with just 10 to 20 layers to analyse the images and depth information to control the robot in real time to grasp objects, even when they change position on the scene.

In this paper we investigate the use of Reinforcement Learning (RL) to train an Artificial Intelligence (AI) agent to control a Cobot to perform a given pick-and-place task, estimating the grasping position without previous knowledge about the objects. To enable the agent to execute the task, an RGBD camera is used to generate the inputs for the system. An adaptive learning system was implemented to adapt to new situations such as new configurations of robot manipulators and unexpected changes in the environment.

2 Theoretical Background

In this section we present a summary of relevant concepts used in the development of our system.

2.1 Convolutional Neural Networks

CNN is a class of algorithms which use the Artificial Neural Network in combination with convolutional kernels to extract information from a dataset. The convolutional kernel scans the feature space and the result is stored in an array to be used in the next step of the CNN.

CNN have been applied in different solutions in machine learning, such as object detection algorithms, natural language processing, anomaly detection, deep reinforcement learning among others. The majority of the CNN application is in the computer vision field with a highlight to object detection and classification algorithms. The next section explores some of these algorithms.

2.2 Object Detection and Classification Algorithms

In the field of artificial intelligence, image processing for object detection and recognition is highly advanced. The increase of Central Processing Unit (CPU) processing power and the increased use of Graphics Processing Unit (GPU) have an important role in the progress of image processing [11].

The problems of object detection are to detect if there are objects in the image, to estimate the position of the object in the image and predict the class of the object. In robotics the orientation of the object can also be very important to determine the correct grasp position. A set of object detection and recognition algorithms are investigated in this section.

Several features arrays are extracted from the image and form the base for the next layer of convolution and so on to refine and reduce dimensionality of the features, the last step is a classification Artificial Neural Network (ANN) which is giving the output in a form of certainty to a number of classes. See Fig. 1 where a complete CNN is shown.

Fig. 1.
figure 1

CNN complete process, several convolutional layers alternate with pooling and in the final classification step a fully connected ANN [12].

The learning process of a CNN is to determine the value of the kernels to be used during the multiple convolution steps. The learning process can take up to hours of processing a labeled data set to estimate the best weights for the specific object. The advantage is once the model weights have been determined they can be stored for future applications.

In [13] a Regions with Convolutional Neural Networks (R-CNN) algorithm is proposed to solve the problem of object detection. The principle is to propose around 2000 areas on the image with possible objects and for each one of these extract features and analyze with a CNN in order to classify the objects in the image.

The problem of R-CNN is the high processing power needed to perform this task. A modern laptop is able to analyze a high definition image using this technique in about 40 s, making it impossible to execute real time video analysis. But still capable of being used in some applications where time is not important or where it is possible to use multiple processors to perform the task, since each processor can analyze one proposed region.

An alternative to R-CNN is called Fast R-CNN [14] where the features are extracted before the region proposition is done, so it saves processing time but loses some abilities to parallel processing. The main difference to R-CNN is the unique convolutional feature map from the image.

The Fast R-CNN is capable of near real time video analysis in a modern laptop. For real time application there is a variation of this algorithm proposed in [15] called Faster R-CNN. It uses the synergy of between steps to reduce the number of proposed objects, resulting in an algorithm capable of analyzing an image in 198 ms, sufficient for video analysis. Faster R-CNN has an average result of over 70% of correct identifications.

Extending Faster R-CNN the Mask R-CNN [16, 17] creates a pixel segmentation around the object, giving more information about the orientation of the object, and in the case of robotics a first hint to where to pick the object.

There are efforts to use depth images with object detection and recognition algorithms as shown in [18], where the positioning accuracy of the object is higher than RGB images.

2.3 Deep Reinforcement Learning

Together with Supervised Learning and Unsupervised Learning, RL forms the base of ML algorithms. RL is the area of ML based on rewards and the learning process occurs via interaction with the environment. The basic setup includes the agent being trained, the environment, the possible actions the agent can take and the reward the agent receives [19]. The reward can be associated with the action taken or with the new state.

Some problems in RL can be too large to have exact solutions and demand approximate solutions. The use of deep learning to tackle this problem in combination with RL is called Deep Reinforcement Learning (deep RL). Some problems can require more memory than available, i.e., a Q-table to store all possible solutions for an input color image of 250 \(\times \) 250 pixels would require \( 250\times 250\times 255\times 255\times 255 = 1.036.335.937.500\) bytes, or \({1}\,{\text {TB}}\). For such large problems the complete solution can be prohibitive by the required memory and processing time.

2.4 Deep Q Learning

For large problems, the Q-table can be approximated using ANN and CNN to estimate the Q values. Deep Q Learning Network (DQN) was proposed by [20] to play Atari games on a high level, later this technique was also used in robotics [21, 22]. A self balanced robot was controlled using DQN in a simulated environment with performance better than Linear–quadratic regulator (LQR) and Fuzzy controllers [23]. Several DQNs have been tested for ultrasound-guided robotic navigation in the human spine to locate the sacrum with [24].

3 Proposed System

The proposed system consists of a collaborative robot equipped with a two-finger gripper and a fixed RGBD camera pointing to the working area. The control architecture was designed considering the use of DQN to estimate the Q-values in the Q-Estimator. RL demands multiple episodes to obtain the necessary experience. Acquiring experience can be accelerated in a simulated environment, which can also be enriched with data not available in the real world. The proposed architecture shown in Fig. 2 was designed to work in both simulated and real environments to allow experimentation on a real robot in the future.

Fig. 2.
figure 2

Proposed architecture for grasp learning, divided in execution side (left) and learning sides (right). The modules in blue are ROS Drivers and the modules in yellow are Python scripts.

The proposed architecture uses Robot Operating System (ROS) topics and services to transmit data between the learning side and the execution side. The boxes shown in blue in Fig. 2 are the ROS drivers, necessary to bring the functionalities of the hardware to the ROS environment. The execution side can be simulated, to easily collect data, or real hardware for fine tuning and evaluation. As in [22], the action space is defined as motor control and the Q-values correspond to probability of grasp success.

The chosen policy for the RL algorithm is a \( \varepsilon \)-greedy, i.e., pursue the maximum reward with \( \varepsilon \) probability to take a random action. R-Estimator estimates the reward based on the success of the grasp and the distance reached to the objects, following Eq. 1.

$$\begin{aligned} \mathcal {R}_t = {\left\{ \begin{array}{ll} \frac{1}{d_t + 1}, &{} \text {if } 0 \le d_t \le 0.02\\ 0, &{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(1)

where \( d_t \) is in meters.

3.1 Action Space

The RL gives freedom to choose the possible actions the agent can choose, in this work actions are defined as the possible positions to attempt to grasp an object inside the work area, defined as:

$$\begin{aligned} \mathcal {S}_{a} = \{ v, w \}, \end{aligned}$$
(2)

where \( \{v\} \) is the proportional position inside the working area in the x axis and \( \{w\} \) is the proportional position inside the working area in the y axis. The values are discretized by the output of the CNN.

3.2 Convolutional Neural Network

To estimate the Q-values a CNN is used. For the action space \(\mathcal {S}_{a}\) the network consists of two blocks to extract features from the images, a concatenation of the features and another CNN to reach the Q-values. The feature extraction blocks are pre-trained Pytorch models where the final classification network is removed. The layer to be removed is different for each model and, in general, the fully connected layers are removed. Four models were selected to compose the network, DenseNet, MobileNet, ResNext and MNASNet. The criteria considered the feature space and the performance of the models.

Fig. 3.
figure 3

The CNN architecture for the action space \(\mathcal {S}_{a}\), the two main blocks are a simplified representation of pre-trained Densenet model [25], only the feature size is represented. The features from the Densenet model are concatenated and used to feed the next CNN, the result is an array with Q-values used to determine the action.

The use of pre-trained PyTorch models reduces the overall training time. However it brings limitations to the system, the size of the input image must be 224 by 224 pixels and the image must be normalized following the original dataset mean and standard deviation [26]. In general this limits the working area of the algorithm to an approximately square area (Fig. 3).

3.3 Simulation Environment

The simulation environment was built on Webots, an open-source robotics simulator [27]. The choice has been made considering the usability of the software and use of computational resources [28]. To enclose the simulation in the ROS environment some modules were implemented: Gripper Control, Camera Control and a Supervisor to control the simulation. The simulated UR3e robot is connected to ROS using the ROS driver provided by the manufacturer and controlled with the Kinematics module. Figure 4 shows the simulation environment, in which the camera is located in front of the robot, pointing to the working area. A feature of the simulated environment is to have control over all objects positions and colors. The positions were used as information for the reward and the color of the table was changed randomly at each episode to increase robustness during training. For each attempt the table color, the number of objects and the position of the objects were randomly changed.

Fig. 4.
figure 4

The virtual environment built on Webots: it consists of a table, a UR3e collaborative robot, a camera and the objects used in the training.

Webots Gripper Control. The Gripper Control is responsible to read and control the position of the joints of the simulated gripper. It controls all joints, motors and sensors of the simulated gripper. Touch sensors were also added at the tip of the finger to emulate the feedback signal when an object is grasped.

The Robotiq 2F-85 is the gripper we are going to use in future experiments with the real robot. It consists of 6 rotational joints intertwined to form the 2 fingers. During tests, the simulation of the closed kinematic chain of this gripper in Webots was not stable. To regain stability in simulation we used a gripper with simpler mechanical structure but with similar dimensions of the Robotiq 2F-85. The gripper used in simulation is shown in detail in Fig. 5.

Fig. 5.
figure 5

Detail of the gripper used in the simulation: its appearance is based on the Kuka Youbot gripper and its bounding objects are simplified to blocks.

Webots Supervisor. The Supervisor is responsible for resetting the simulation, preparing the position of the objects at the beginning of the episode, changing color of the table and publishing the position of objects to the reward estimator. To estimate the distance between the center of the end-effector and the objects, a GPS position sensor is placed in the gripper’s center to inform its position to the supervisor. The position of the objects is used to shape the reward proportional to the distance between the end-effector and the object. Although this information is not available in the real world they are used to speed up the simulation training sessions.

Webots Camera. The camera simulated in Webots has the same resolution of the Intel RealSense camera. To avoid the need of calibration of the depth camera, both RGB and depth cameras had coincident position and field of view in simulation. The field of view is the same as the Intel RealSense RGB camera: 69.4\(^\circ \) or 1.21 rad.

3.4 Integrator

The Integrator is responsible for connecting all modules, simulated or real. It controls the Webots simulation using the Supervisor API and feed the RGBD images to the neural network.

Kinematics Module. The kinematics module controls the UR3e robot, simulated or real. It contains several methods to execute the calculations needed for the movement of the Cobot.

Although RL has been used to solve the kinematics in other works [22, 29], this is not the case in our system. Instead, we make use of analytical solution of the forward and inverse kinematics of the UR3e [30]. The Denavit–Hartenberg parameters are used to calculate forward and inverse kinematics of the robot [31]. Considering the UR3e has 6 joints, the combination of 3 of these can give \(2^3=8\) different configurations which can give the same pose of the end-effector (elbow up and down, wrist up and down, shoulder forward and back). On top of that, the movement of the UR3e joints have a range from \(- 2\pi \) to \(+ 2\pi \) rad, increasing the possible solution space to \(2^6=64\) different configurations to the same pose of the end-effector. To reduce the problem, the range of the joints is limited via software to \(- \pi \) to \(+ \pi \) rad, but still giving 8 possible solutions from where the nearest solution to the current position is selected.

The kinematics module is capable of moving the robot to any position in the work space avoiding unreachable positions. To increase the usability of the module functions with the same behavior of the original Universal Robots “MOVEL” and “MOVEJ” have been implemented.

To estimate the cobot joints angles in order to position the end-effector in space the Tool Center Point (TCP) must be considered in the model. TCP is the position of the end-effector in relation to the robot flange. The real robot that will be used for future experiments has a Robotiq wrist camera and a 2F-85 gripper, which means that the TCP is 175.5 mm from the robot flange in the z axis [32].

4 Results and Discussion

This section shows the results and discussion of two training sections with different methods. The tests were performed on a laptop with a i7-9750H CPU, 32 GB RAM and a GTX 1650 4 GB GPU, running Ubuntu 18.04. Although the GPU was not used in the CNN training, the simulation environment made use of it.

4.1 Modules

All modules were tested individually to ensure proper functioning. The ROS communication was tested using the builtin tool , to check the connection between nodes via topics or services. The UR3e joints positions are always published in a topic and controlled via action client. In the simulation environment, the camera images, the gripper control and the supervisor commands are made available via ROS services. Differently from ROS topics, ROS services only transmit data when queried, decreasing the processing demanded by Webots. Figure 6 shows the nodes via topics in the simulated environment, services are not represented in this diagram. The diagram was created with .

Fig. 6.
figure 6

Diagram of the nodes running during the testing phase. In the simulated environment most of the data is transmitted via ROS services. In the picture, the topics /follow_joint_trajectory and /gripper_status are responsible for the robot movement and griper status information exchange, respectively.

CNN. From the four models tested, DenseNet and ResNext demanded more memory than the available GPU while MobileNet and MNASNet were capable of running on the GPU. To keep the fairness of the evaluation all timing tests were performed on the CPU.

4.2 Training

For training the CNN it was used a Huber loss error function [33] and an Adam optimizer [34] with weight decay regularization [35], the hyperparameters used for RL and CNN training are shown in the Table 1.

Table 1. Hyperparameters used in training.

To avoid color bias of the algorithm the color of the simulated table was changed for every episode.

Each training section was divided in four parts: collecting data, deciding the action to take based on the estimated Q-values, taking the action receiving a reward and training the CNN. Several sections of training were performed and the experience of the previous rounds were used to improve the training process.

The training cycle times are shown in Table 2. Forward is the process following the direction from input to output of the CNN, backward is the process to evaluate the gradient from the difference in the output back to the input. In the backward process the weights of the network are updated with the learning rate \(\alpha _{CNN}\).

Table 2. Mean time and standard deviation of forward and backward time during training.

First Training Section. In the first training round no previous experience is used and the algorithm learns from scratch. The main target is to get information of the training process about cycle time and acquire experience to be used in future training sections. The algorithm was training according to the most recent experience with batch size of 1.

In the training sections the accuracy was estimated based on 10 attempts every 10 epochs to verify how good the algorithm was performing at the time. The results are shown in Fig. 7. The training section took from 1:43 to 2:05 h to complete.

In Fig. 7 is observed a training problem where the loss reaches zero and there is no gradient for learning. The algorithm cannot learn and the accuracy shows the q-values estimated are poor. There are several causes that can explain this case including the weights of the CNN are too small and the experience accumulated has most errors. The solutions for this are complex including fine-tuning hyperparameters and selecting best experiences for the algorithm as shown in [36]. Another solution is to use demonstration through shaping [37], where the reward function is used to generate training data based on demonstrations of the correct action to take. The training data for the second section was generated using the reward function to map all possible rewards of the input.

Fig. 7.
figure 7

The loss and accuracy of 1000 epochs training section, loss data were smoothed using a third order filter, raw data is shown in light colors.

Fig. 8.
figure 8

The loss and accuracy during 250 epochs training section, data were smoothed using a third order filter, raw data is shown in light colors.

Second Training Section. The second training section used the demonstration through shaping. It was possible because in the simulation environment the information of the position of the objects is available. The training process received experiences generated from the simulation, these experiences have the best action possible for each episode.

The batch size used on this training section was 10. The increase of batch size combined with the new experience replay caused a larger loss at the beginning of the training section as seen on the Fig. 8. The training section took from 3:43 to 4:18 h to complete. The accuracy as estimated for every epoch based on 10 attempts.

5 Conclusion

This paper presented the use of RL to train an AI agent to control a Collaborative Robot to perform a pick-and-place task while estimating the grasping position without previous knowledge about the object. It was used an RGBD camera to generate the inputs for the system. An adaptive learning system was implemented to adapt to new situations such as new configurations of robot manipulators and unexpected changes in the environment. The results implemented on simulation validated the proposed approach. As future work, an implementation with a real manipulator will be addressed.