Keywords

1 Introduction

Along with the strong development of the industrial revolution 4.0, the field of robotics develops rapidly and continuously, being applied in all fields and different uses such as: underwater robots, robots on the ground, robots in the sky, robots in space, … Some specific types of robots can reduce human work such as: Mobile robots, Robot Manipulators, Bio Inspired Robots, Personal Robot. In automated factories, smart factories, robotic arms are widely used to assist people in holding, lifting, moving objects, increasing work productivity or supporting people in a toxic and dangerous environments.

Robotic systems can be controlled manually by buttons, handles, …, or controlled by software using microcontrollers. A research study in [1] focuses on the design of a mobile robot equipped with a robotic arm utilizing a microcontroller and wireless communication. Another study in [2] outlines the design and control of a two-armed robot with seven degrees of freedom (DOF). In [3], a study explores the design of a robotic arm with 4 DOF, capable of performing individual tasks such as grasping, lifting, placing, and releasing objects. Another study in [4] presented on a 3D object recognition and pose estimation for random garbage selection using partition viewpoint feature histogram. More recently, a study in [5] presented a 3-D objects pose an estimation method for bin-picking using a combination of the semantic point pair feature method and the Mask-RCNN algorithm.

Building upon previous studies, the objective of this study is to design and control a robotic arm with six degrees of freedom, capable of picking up 3D objects through the integration of 3D image processing, voice recognition and deep learning technology.

2 System Overview

Figure 1 shows the object picking robot system. The system includes a robotic arm with gripper, RealSense camera, windows forms control interface. The RealSense camera is used to acquire the RGB-D images. The resulting images are used for object detection through deep learning technology. The RealSense camera also captures 3-D point clouds representing various objects. The evolved 3-D object recognition and localization algorithm is utilized to accurately determine the position and orientation of the target object. Voice control commands are integrated through the laptop’s audio acquisition system. The robot uses these parameters to recognize and pick up randomly requested objects.

Fig. 1
A side-view photo of a robotic arm situated in a workspace. The robotic arm is equipped with multiple attachments at its end and is positioned above a table with an eraser and whitener kept in front of it. The workspace contains metallic frames supporting various pieces of electronic equipment, including control panels.

The developed robotic bin-picking system

3 3-D Object Recognition and Segmentation

We have random objects with different properties on the table. The goal of this study is to identify individual objects through the prediction of object features and input point cloud segmentation. Each object will be represented by a name label and point cloud. The position of the name label and the point cloud are compared, thereby representing the final object in the orientation bounding box. Finally, the object’s name label, location coordinates and orientation are determined for the robot to pick up the object as required. The flowchart of the proposed method is shown in the Fig. 2.

Fig. 2
A flow diagram of 3-D object recognition proposed method runs as follows. Voice recognition links to an object's name and R G B D camera links to color and depth. Object's name and color together link to Yolo V 3. Depth with Yolo V 3 links to detected objects that link to 3-D object matching and pose. C A D model links to 3-D object matching.

Flowchart of the proposed method

3.1 Voice Recognition

The voice recognition system supports humans in interacting with robots more flexibly. In this study, Microsoft’s speech API [6] was used to simplify speech recognition and give commands to the robot to perform.

A simple system is consisting of 2 components: speech synthesis and speech recognition system. Speech synthesis is the process of creating sound or speech through a computer. The received sound will be heard by the computer and recognize the words and phrases, called speech recognition. Voices in predefined cases are recognized to give tasks to the robot to perform.

3.2 2-D Object Detection

Object detection is one of the fundamental and important tasks of machine learning. Currently, there are many different algorithms that effectively support the detection and classification of interested objects. Such as deep learning algorithms based on convolutional neural networks (Fast R-CNN, Faster R-CNN, Mask R-CNN, etc.). Regression-based algorithms for fast detection of layers and bounding boxes objects such as Yolo, which can be used to recognize objects in real time. Deep learning algorithms based on convolutional neural networks are prioritized for using in many machine learning models. For example, the multi-layered fruit classification model using robot vision and Faster R-CNN network by Wan and Goudos [7], the segmentation model and damage detection in cars using Mask R-CNN by Zhang et al. [8]. The advantages of these algorithms are very good recognition performance and high accuracy. However, in the model we built, the chosen algorithm is Yolo-V3 due to its advantages of fast detection speed, which can be run in real time. The effectiveness of this algorithm is demonstrated through recent publications such as the tomatoes recognition model of Liu et al. [9], the real-time face recognition machine learning model of Chen et al. [10].

The Yolo-V3 model leverages the network architecture of Darknet-53, comprising of 53 convolutional layers and 5 maximum pooling layers. To mitigate overfitting, batch normalization and dropout operations are incorporated after each convolutional layer. The Darknet-53 architecture features five residual blocks, incorporating the concept of residual neural networks. The network diagram of Darknet-53 is depicted in Fig. 3, and the overall structure of the Yolo-V3 network can be seen in Fig. 4.

Fig. 3
A network diagram of the Darknet-53. It consists of a table with 4 columns for 5 components 1 times, 2 times, 8 times, 8 times, and 4 times. The column headers are type filters, size, and output. The right has 2 flowcharts labeled convolutional and residual.

Darknet-53 structure

Fig. 4
A network diagram of the Yolo V 3 network structure. It consists of a table that has 4 columns for 5 components 1 times, 2 times, 8 times, 8 times, and 4 times. The column headers are type filters, size, and output. On the right top, D B L = c o n v, B N, and leaky r e l u. Rows 8 times and 4 times link to a flowchart that ends at detection.

Yolo-V3 network structure

3.3 Point Cloud Segmentation

The 3-D point cloud obtained from the camera contains information about the various objects in the scene. Splitting the input point cloud into smaller point clouds containing information that distinguishes each individual object, for later use. Several other studies have been published using Voxel Net, a LiDAR-based on the 3-D object detection network [11] or the 3-stage point cloud segmentation as introduced in [12]. The idea proposed in this study is to combine the detected objects in 2-D color image and the depth map to extract the target objects. After that, the extracted point clouds are filtered to remove noise and unnecessary data. The result of the point cloud segmentation is shown in Fig. 5.

Fig. 5
Two outcomes of point cloud segmentation. On the left, an unsegmented point cloud is enclosed within a rectangular prism. On the right, a segmented and more defined shape emerges from the point cloud. The points are organized and form a long rounded oval structure with a tail.

Point cloud segmentation

3.4 Object Localization and Pose Estimation

To estimate 3-D position and orientation of the target object in the scene point cloud, the extracted target point cloud is aligned with the CAD model by employing the Iterative Closest Point (ICP) algorithm [13]. The ICP algorithm is used to obtain the transformation to refine the original estimated 3-D pose. The ICP algorithm is a matching process being employed to minimize the fitting deviation between two matching point clouds. The ICP algorithm iteratively revises the transformation needed to minimize the distance between the points of two raw scans. After the object recognition and localization, the target object will be picked by a parallel jaw clamp as depicted in Fig. 6.

Fig. 6
A side-view photo of a robotic arm situated in a workspace. The robotic arm is equipped with multiple attachments at its end and is positioned above a table with an eraser and whitener kept in front of it. It bends forward and holds the eraser.

The target is picked by the robot arm

4 Experimental Results

The experiment was tested with many kinds of objects with randomly positions and orientations. Minimum size of the objects is 0.01 × 0.01 × 0.01 m. Calibration results are highly accurate, so robot control can be applied to pick up the right objects as required. The data was processed on a computer with a core I7 processor (2.8 GHz and 8 GB RAM). The average processing time of 20 experiments is about 1.5 s for object localization. If using GPU, the result will be processed much faster (about 10 times faster than using CPU). Through voice control commands, the robot successfully determines the task of finding the location of the 3D objects that coincides with the required mask to carry out the object picking. In some cases, due to the influence of the environment and pronunciation, the robot may not be able to recognize the voice commands.

5 Conclusion

This research has achieved the following purposes:

  1. 1.

    Designing a robot arm model, which can detect, grab, and move objects to the required area.

  2. 2.

    Robot successfully identifies and locates objects through camera and 3-D image processing algorithm. Successfully controlling objects through voice commands and recognizing objects through using deep learning technology.