Keywords

1 Intorduction

In recent years, Japan has faced a social problem of a declining working population due to rapid aging and a low birthrate after peaking in 2008. [1] This has led to a reduction in the size of the economy and a decline in international competitiveness. Thus, it is necessary to create more added value with a limited labor force. Industrial robots are an effective solution to this problem since they can perform tasks in place of humans, stabilize production efficiency, operate for long hours, and improve quality by preventing human error [2, 3]. As a result, the demand for industrial robots has been growing and is expected to continue to grow in the future. By 2025, the Japanese domestic robotics market will be worth approximately 5.3 trillion yen and 9.7 trillion yen by 2035 [4, 5]. However, conventional industrial robots require predefined movements, which is an obstacle to their widespread use. To overcome this difficulty, the use of deep learning to make industrial robots intelligent has been attracting attention. Therefore, many network models for grasp position estimation by deep learning have been proposed. In the early studies of grasp position estimation, rule-based methods [6, 7] and object detection-based methods [8, 9] were mainly used. Therefore, a generative convolutional neural network [10, 11] was applied and succeeded in reducing the weight. Furthermore, a suitable grasping posture can be predicted from extracted pixel-by-pixel features, making the method more suitable for grasping tasks. [12,13,14] Recently, GR-ConvNet [15] achieved state-of-the-art graspingt detection accuracy by introducing a residual structure [16] into the network model.

2 Method

In this chapter, we use a method to represent the grasping position that is similar to the one used in GR-Convnet. This helps us detect the grasping position accurately. Next, we introduce an even better architecture. In the bottleneck, we merge the ResNet block and the SE block composite module together at the same time, and also bring in the csSE block to the decoder. In the output layer, we create a network that predicts the grip’s quality, angle, and width separately, by utilizing the features we have extracted, and without needing to use dropout.

2.1 Formulation of Grasping Position

The position of a gripper in 3D space is expressed as

$$ \begin{array}{*{20}c} {G_r = \left( {P, \theta_r , W_r , Q} \right)} \\ \end{array} $$
(1)

where P is the center position of the gripper tip, \({\theta }_{r}\) is the rotation of the gripper, \({W}_{r}\) is the gripper width, and Q is the probability of a successful grasp. The position of the grasp in the image is expressed as

$$ G_i = \left( {x,y,\theta_i ,W_i ,Q} \right) $$
(2)

where (x, y) is the center point of the object in the image, \({\theta }_{i}\) is the rotation angle, \({W}_{i}\) is the width of the gripper in the image, and Q is the probability of a successful grasp. The transformation of the grasping position information from the image to the camera space and then to the world coordinates is expressed as

$$ \begin{array}{*{20}c} {G_r = T_{rc} \left( {T_{ic} \left( {G_i } \right)} \right)} \\ \end{array} $$
(3)

The grasping positions of multiple objects are expressed as

$$ G = { }\left( {\theta ,W,Q} \right) \in {\rm{\mathbb{R}}}^{n \times h \times w} $$
(4)

in Eq. (4), where θ is a quality map representing the success rate of grasping, W is a width map representing the width of the end-effector, and Q is an angle map representing the angle of the end-effector.

2.2 Proposed Network Model

This is the network architecture proposed in Fig. 1 we constructed a network model with reference to the GR-convnet. The E block and D block have the same conventional structure and are shown at the bottom of Fig. 1. The B block, F block, and output layer will be described in a later section.

Fig. 1.
figure 1

Network model of the proposed method

Bottlenck Module:

The bottleneck module shown in Fig. 1 utilizes the module shown in Fig. 2 and combines SE blocks with ResNet to emphasize pixel-level information by weighting each channel with a sigmoid function. In [17], the optimal placement of the SE block in the Residual Block [16] is investigated. Among the tested patterns, this paper uses the SE-Identity Block [17]. Similarly, it has been applied to ResNet-50 [18] and inception-resnet [19], achieving top scores in the ILSVRC 2017 competition.

Fig. 2.
figure 2

Improved bottleneck block, combining Resnet with SE block in parallel.

Deletion of Output Layer Dropout:

Dropout [20] and Batch Normalization [21] are techniques to prevent overfitting in deep learning models. Dropout randomly selects certain neurons to be unused during each mini-batch, while Batch Normalization normalizes each node’s output to prevent bias. Combining these techniques may lead to decreased accuracy [22], so Dropout is not used in the paper according to previous research.

Decoder Module:

The decoder module, shown in Fig. 1, also includes the F block (Fig. 3) that combines the Concurrent Spatial and Channel Squeeze and Channel Excitation (scSE) Block [23] with ResNet. The scSE Block was designed for segmentation and emphasizes pixel-level information by convolving each channel and calculating weighting for each pixel. The module aims to reduce noise and information loss when restoring the image to its original size.

Fig. 3.
figure 3

Improved decoder module.

3 Evaluation of Network Models

In this chapter, we define assumptions and evaluation methods for inference and perform inference. Furthermore, we compare the performance of the proposed method with that of previous research [16], and discuss the results.

3.1 Learning Environment

In this section, we used Nvidia’s 32[GB], RTX2080Ti. Each network is trained for 100 epochs with a learning rate of \({10}^{-3}\) and a batch size of 8.

3.2 Dataset

The Jacquard dataset [24] is annotated using a simulation environment and CAD model, without human intervention. It was split into training, validation, and test data with an 8:1:1 ratio using two methods: IW (image-wise) for evaluating generalization ability for unknown object postures, and OW (object-wise) for evaluating generalization performance for unknown objects. In GR-ConvNet [15], training and validation were divided 9:1.

3.3 Optimization Function

We used Radam [25] as our optimization function in this study, which was proposed by Liu et al. in 2020. Adam is a commonly used optimization function for deep learning, but Liu et al. found that it has a high variance in the early stages of learning. To address this, they proposed a method that uses SGD [20] with momentum in the initial stages and corrects Adam [26] with a correction term afterwards. This method outperforms Adam on tasks such as ImageNet image classification.

3.4 Loss Function

In this study, we use the Smooth L1 Loss used in Gr-ConvNet [8].

$$ loss(G_t - G_p ) = \left\{ {\begin{array}{*{20}l} {0.5(G_t - G_p )^2 ,} \hfill & { if|G_t - G_p | < \beta } \hfill \\ {|G_t - G_p | - 0.5,} \hfill & { otherwise} \hfill \\ \end{array} } \right. $$
(5)

Here, Gt represents the true value and Gp the predicted value, where β is the threshold value, and the upper and lower equations are used separately. MAE Loss and MSE Loss are common loss functions in deep learning. MAE Loss is good for outliers because it treats errors as absolute values, but it cannot be differentiated when the true and predicted values are equal. Smooth L1 Loss uses MSE Loss when the absolute error is smaller than the threshold β, and uses MAE Loss when the absolute error is larger than the threshold β to reduce the effect of outliers. In this study, the threshold β is set to 1.0.

3.5 Evaluation Index

This is an evaluation index similar to GR-ConvNet [15]. The condition for considering the grasping position estimated by the network model as correct is when the following two conditions are satisfied:

  1. 1.

    The IOU between the bounding box of the inferred grasping position and the ground truth value is greater than 0.25.

  2. 2.

    The error between the inferred grasp angle and the ground truth value is less than 30°.

The evaluation index is calculated using the following Eq. (6).

$$ {\text{Accuracy}}\;{\text{rate[\% ] = }}\frac{{{\text{The}}\;{\text{number}}\;{\text{of}}\;{\text{correct}}\;{\text{answers}}}}{{{\text{number}}\;{\text{of}}\;{\text{datasets}}}} $$
(6)

3.6 Object to be Grasped

This section describes the grasping object in the grasping experiment using the actual machine.

Household Test Objects:

Twelve objects of different shapes and sizes from each other were prepaed. For example, there are objects that are similar to the objects in the dataset, smaller, transparent, reflective, soft, reflective, and black, which are considered to be difficult to visualize. The household test objects are shown on the left in Fig. 4.

Adversarial Test Objects:

These are objects in the dataset Dex-Net 2.0 [27] used by Mahler et al. to validate the performance of the Grasping Quality CNN, and are objects that are considered difficult to grasp. In this study, eight objects were prepared. The adversarial test objects are shown on the right in Fig. 4

Fig. 4.
figure 4

Left: Test object for home use, right: Adversarial test object

4 Result

In this chapter, we present the results of inference performed with each network model, as well as experimental results using a physical robot arm, and demonstrate the process of inference during the experiments.

4.1 Inference with Network Models

We made three different versions of the model. Modification (1) improved only one part, the bottleneck. Modification (2) improved two parts, the bottleneck and output layer. The proposed method improved three parts, the bottleneck, output layer, and decoder. We compared the original model with the three improved models to show how effective each improvement was. Results are shown in Table 1.

Table 1. Results for each network model

4.2 Experiments with Robotic Arms

Experiments were conducted on GR-ConvNet [15] and the proposed method, which was the most accurate according to Table 3. The experimental results are shown in Table 2. 86.5% for GR-ConvNet[15] and 93% for the proposed method.

Table 2. Experimental results

4.3 Visualization of Inference

To show the results of grasping position inference, a Quality Map is used as an image to indicate that red has a high probability of grasping. Table. 3 shows the inference results.

Table 3. Inference by actual equipment

5 Discussion and Conclusion

The effectiveness of the proposed method was demonstrated through network evaluation and gras experiments using a robot [28, 29]. As shown in Table 1, each improvement was found to improve the accuracy of the network. Additionally, it is believed that the accuracy was improved without significantly increasing the number of parameters. Furthermore, as shown in Table 2, it was confirmed that improving the accuracy of the network also improves the accuracy of the physical grasp experiment using a robot arm.

According to Table 3, the success rate of grasping difficult-to-recognize black objects has improved. In the case of failure with GR-ConvNet, it was found that the grasping angle was not appropriate for the remote control. In addition, for wristwatches, there were cases where the reflection part was not detected at the appropriate grasping position. However, the proposed method makes it clear where the expected position for high success rate of grasping is located. Moreover, both methods showed low accuracy for small and transparent objects. One of the reasons for grasp failure is that these objects are difficult for humans to visually recognize. Due to their low visibility, the amount of image information obtained is limited, and accurate inference needs to be made from this limited information.

In the future, efforts will be focused on improving the grasp grip of objects that are difficult for humans to visually recognize [30, 31], by improving the network model and changing evaluation criteria. Preprocessing techniques such as removing reflections and enlarging small objects, as well as other approaches, will also be considered to obtain more information.