Keywords

1 Introduction

In the RoboCup Soccer Standard Platform League (SPL), ball detection is a fundamental and crucial ability for robotics, which is used to provide target distance and specific location for robots in various light environment. In addition, as the only standard device used in the SPL, the Softbank Robotics NAO has very limited resources, such as constrained computational abilities, and limited camera resolution. So, designing a real-time and efficient ball detection system has been a challenging task to address in the games: tiny size, blurred image, uneven illumination, occlusion and many similar objects. Traditional machine learning and image processing methods for ball recognition usually lead to a lot of and false positives and missed recognition.

State-of-the-art CNN shows excellent abilities of classification and object detection, but existing CNN-based detectors suffer from massive computational cost with server-class GPUs. When it comes to the application of CNN to mobile devices, there are several progresses in lightweight object detectors based on CNN, like YOLO-LITE, tiny-Yolo, Xception, MobileNet, XNOR-Net [1, 11,12,13,14,15,16], to improve the less computational cost on GPUs. However, the real-time operation performances are not able to meet the competition requirements when the nets are transferred to NAO robots vision application. In order to ensure the competitiveness of the game, we have to consider the balance between the real-time performance and detection accuracy. Furthermore, because the input resolutions are related to the model operation performance, we use the NAO’s camera to capture a lot of images in various light conditions in our lab and competition field. In total, we captured 1008 unique images. While guaranteeing the detection performance, we pay more attention to the detection efficiency.

In this paper, we firstly provide a dataset, and then we investigate the effectiveness of depthwise convolution with binary weight in achieving real-time operational ability and desired detection accuracy on NAO robot. In the network design part, we described the process of building the network structure step by step in detail. The experimental results show that the computing time of the network structure designed by us decreases gradually without significant performance degradation. Satisfactory results have also been achieved in practical application.

The remainder of this letter is organized as follows. Section 2 introduces related works and analyses the existing shortcomings. Section 3 introduces our dataset in detail. Section 4 describes the procedure of developing the network structure step by step. Followed by Sect. 5 experiments and Sect. 6 conclusions.

2 Related Works

Lightweight CNN.

As state-of-the-art one-stage object detection algorithms, YOLO [12,13,14] and SSD [18] enable to run in real time on GPUs with high accuracy. And YOLOv3-tiny [14] further improve efficiency of detection with acceptable accuracy on GPUs. But all of them suffer from massive computational cost. Recently, there have been several progresses in developing object detection algorithms to attribute to mobile and embedded vision applications, like MobileNet [15], and ShuffleNet [17]. However, these architecture designs are inspired by depthwise separable convolution which lacks efficient implementation. And other Pelee [3] enables to be executed on mobile devices at low frame rates. Compared with SSD MobileNet V1, YOLO-LITE [1] achieves the progress of computational speed improvement, but at the cost of losing the detection accuracy. So considering the balance between the real-time performance and detection accuracy on NAO robot vision application, we propose a real time lightweight CNN based on depthwise convolution with binary weight in NAO Robots for ball detection with better performance. Our design is mainly focused on efficiency.

Compression of CNNs.

Generally, compression of CNNs enables to reduce the parameters and storage space of the model by means of related methods, such as pruning, quantization and approximation. Different methods have been proposed for pruning a network in [4,5,6,7]. Besides, quantization techniques were shown in [8, 9] for weights and representation of layers quantized in CNNs. With respect to approximation method, the authors proposed using FFT to compute the required convolutions in [10]. [11] Proposed a novel CNN which introduced two efficient approximations to CNNs by weight binary: Binary-Weight-Networks and XNOR Networks. In Binary-Weight-Networks, the weight values are approximated with the closest binary values, resulting in a 32x size smaller. Furthermore, XNOR Networks in which both the weights and the input of convolutional layers are binary values offers 58x speed up on a CPU, but 12.4% accuracy dropping in top-1 measure, by utilizing mostly XNOR and bit-counting operations. Inspired from the idea, our work utilizes the binary weight to compress our model.

Detection Algorithms on NAO.

In response to the ball detection task, different recognition algorithms are proposed by the teams participating in the competition from all over the world. UChile, the Chilean team, has proposed a classification algorithm based on pentagonal recognition, which can better classify the positive and negative samples of the ball, but when the image is blurred, there are many missing recognition. German team HULK proposed a classification algorithm using Haar features. Although it enables to improve the accuracy of recognition to some extent, it takes a long time to compute and operation, resulting in the slow reaction of the robotics. Nao-Team HTWK and UT Austin Villa utilize a shallow CNN classifier, but they have to add other traditional image processing methods to generate Hypotheses first, and then use the CNN classifier to determine whether each hypothesis is a ball. However, in this way, the process of generating hypotheses with traditional methods will lead to leak recognition in all likelihood. Additionally, good features of the ball will be lost in the resize process. Except those, the shallow network with only 1–2 convolutional layers, generally consisting of convolution, Batch Normalization, ReLU activation and Max pooling layers, results in the weak feature extraction ability and poor generalization of the classifier.

3 Data Set

The proposed dataset was collected in our lab and real RoboCup competition fields, consisting of 1008 unique images with ball. Generally, the original images captured from the NAO’s cameras is YUV format and the size of the images are lowered to 640 × 480 pixels and 320 × 240 pixels from the upper and lower camera, respectively. In order to speed up the operation process and improve the robustness under various scenarios, only the luminance (Y) channel of each image was extracted from NAO in action with various light conditions. Only when the original dataset obtained, were the ball pixels manually labelled. For the purpose of acceleration while ensuring detection accuracy, we resized the input images with label to a middle size of 416 × 416 pixels for later training and testing. An example of the proposed dataset is shown as Fig. 1. And the specific Dataset is online at https://github.com/qyan0131/Binary-8-DataSet.git.

Fig. 1.
figure 1

Dataset for training and testing: It consists of 1008 Y channel images with ball captured from the NAO’s cameras. The size of the images is lowered to 640 × 480 pixels and 320 × 240 pixels from the upper and lower camera, respectively

4 Network Design

In this section, we demonstrate the network design procedure in detail. The proposed network mainly focuses on the effectiveness and still maintain acceptable performances when transferred to NAO. And the design guidelines are composed of four parts. In the backbone part, we first design a small standard CNN as the fundamental network, which is enable to extract enough features for NAO vision detection. And under the premise of maintaining the accuracy for detection, we compress the standard CNN model as the backbone by reducing the number of layers and filters as much as possible. Then inspired from MobileNet ideas, we use the depthwise separable convolution based on the backbone to greatly reduce the number of parameters. Besides, a binary-weights approach is utilized in the point-wise convolution operation to further speed up the computational performance, because point-wise convolution operation takes more than 80% float type computation cost in MobileNet architecture while binary-weights use mostly XNOR and bit-counting operation. Finally, in view of the Intel CPU used by the NAO robot, we rewrite the convolution operation, batch normalization operation and ReLU none-linear activation operation of CNN network with SIMD instructions, which once again increases the speed by many times.

4.1 Backbone

In this paper, we no longer only apply CNN to classifier, we hope to use CNN to achieve end-to-end object detection. As consequence, we first build a backbone with sufficient feature extraction and generalization capabilities. Then, in order to deal with tiny size problem, we use anchor mechanism and design three anchors for different size objects. Finally, the output data structure of the network is given.

In the backbone design procedure, we adopt the sequential iteration method. We weigh the running time, accuracy against the number of layers and channels of the network. Because the number of network layers and layer filters will affect the network parameters and computation costs. The more network layers, the stronger the non-linear ability of the whole network, the stronger the ability to extract features, the stronger the robustness and generalization ability. The more layer filters, the more information flows between adjacent two layers of network, the richer and more accurate the extracted features are. However, increasing the number of network layers or increasing the number of layer filters will result in real-time performance degradation.

Darknet Reference Model is a small but efficient network proposed by [20]. Inspired by it, we prune Darknet Reference network layer by layer and keep training and testing. When the accuracy drops dramatically, we stop pruning the network layer. Then we start reducing layer filters. Similarly, when the accuracy on training and test sets declines significantly, we stop reducing the number of filters. In this way, the backbone containing 8 convolutional layers has been built, as shown in Table 1. We call it Backbone-8.

Table 1. Backbone-8 architecture

4.2 Using Depthwise Convolution

MobileNet [15] uses depthwise separable convolutions, as opposed to YOLO’s method, to lighten a model for real-time object detection. The idea of depthwise separable convolutions combines depthwise convolution and point-wise convolution. Depthwise convolution applies one filter on each channel then pointwise convolution applies a 1 × 1 convolution [15] to expand channels.

Based on [15], related to standard convolutions, using depthwise separable convolutions can get a reduction in computational cost of:

$$ \frac{{D_{K} \cdot D_{K} \cdot M \cdot D_{F} \cdot D_{F} + M \cdot N \cdot D_{F} \cdot D_{F} }}{{D_{K} \cdot D_{K} \cdot M \cdot N \cdot D_{F} \cdot D_{F} }}\, = \,\frac{1}{N} + \frac{1}{{D_{K}^{2} }} $$
(1)

where \( D_{K} \cdot D_{K} \cdot M \cdot N \) is the size of a parameterized convolution kernel K, and \( D_{F} \times D_{F} \times M \) is the size of the feature map taken as input.

In order to make the network mentioned in Sect. 4.1 more real-time, we rewrite the backbone in MobileNet structure called Depthwise-8. In the model, we also use 3 × 3 depthwise separable convolutions to achieve 8 to 9 times less computation than Backbone-8 (Table 2).

Table 2. Depthwise-8 architecture

4.3 Using Weight Binarization

Floating-point operation is time-consuming for device CPU, which is one of the most important factors restricting CNN running on CPU. Weights binarization can convert complex floating-point operations into simple XOR operations to accelerate the computation procedure. In the experiment we found that in Mobilenet architecture, point-wise convolution has limited feature extraction ability but it takes more than 80% of the total computation cost, while depthwise convolution can extract features effectively. Based on the above findings, we apply binary-weight operation to point-wise convolution to further accelerate the whole computation process. According to [11], the convolutional weight can be approximated by:

$$ {\mathcal{A}}_{lk} \, = \, \frac{1}{n}\parallel {\mathcal{W}}_{lk}^{t} \parallel_{l1} $$
(2)
$$ {\mathcal{B}}_{lk} \; = \;sign\left( {{\mathcal{W}}_{lk}^{t} } \right) $$
(3)
$$ \widetilde{{\mathcal{W}}}_{lk} \; = \,\;{\mathcal{A}}_{lk} {\mathcal{B}}_{lk} $$
(4)

Where \( {\mathcal{W} } \in {\mathbb{R}}^{n} ,\text{ }\ell ,\text{ }{\text{k}} \) represent kth filter in lth layer.

Since the number of network layers proposed in this paper is small and the input is 8-bit unsigned integer, if we binaries the network input i.e. the image or the output of each layer, it will take even more time to traverse the whole image. Therefore, this paper only binaries the network weight. The structure of binary-weighted network structure is shown in Table 3. We call it Binary-8.

Table 3. Binary-8 architecture

4.4 Boost Real Time Performance

The network in Sect. 4.3 already has strong real-time performance, but we can still use the SIMD instructions provided by Intel CPU to accelerate the operation on NAO robots to further enhance real-time performance. SIMD stands for Single Instruction Multiple Data. It can copy multiple operands and package them in a set of instructions in a single register. SSE is one of the instructions sets of SIMD which is supported by NAO’s CPU. NAO uses 32-bit Intel CPU with 128-bit register length and 8-bit unsigned integer for CNN image input. Therefore, the operation of 32 pixel values can be processed at one time with SSE, leading to several times faster CNN calculation.

We rewrite the convolution operation, batch normalization operation and ReLU none-linear activation operation of CNN network with SIMD instructions.

5 Experimental Results

5.1 Comparison Among Proposed Networks

We evaluate the performance of our proposed approach on the task of NAO camera image. After successfully training models for our dataset, the network architectures accompany with their respective weights were test on the customized test set. The number of parameters, inference time and accuracy of the three proposed network is shown in Table 4. Figure 2 shows the APs during training phase.

Table 4. Comparison of three proposed network (Intel Atom 1.9 Hz CPU @ 320 * 240 pix)
Fig. 2.
figure 2

Average precision of the three propose network

As shown in the table, every step of the network design improves the speed of computation while keep the correct rate is similar. And the final designed network with SSE optimization only takes 7.1 ms to process an image at the resolution of 320 × 240, which is fast enough to run on NAO robot.

Figure 3 shows the IoU and Loss during training phase of our proposed network. From the figure we can discover that the order of IoU rising speed is Backbone-8 > Depthwise-8 > Binary-8, however, with the increase of training epochs, the IoU of the three networks tends to be stable and the values are very similar. Loss in the figure is similar: although Backbone-8 declined the fastest, Binary-8 declined the slowest, they eventually tend to be stable. The difference is that, the Backbone-8 network’s stable loss is the smallest, while the Binary-8 network is the highest. But considering the trade-off between computation time and performance, Binary-8 is the most efficient network.

Fig. 3.
figure 3

IoU and Loss among proposed networks

5.2 Comparison Among Typical CNN Models

We also compare our proposed network with some of the famous or state-of-the-art lightweight network as shown in Table 5. Weights size, accurate percentage, BFLOPS and inference time on NAO robot are considered.

Table 5. Comparison of typical CNN models on Ball Dataset

According to the result in Table 5, compared with other typical lightweight models, the network we designed shows superior performance. The network structure proposed in this paper improves the computation speed greatly when accuracy is similar.

Experiments show that the proposed network has strong real-time performance (about 140 Fps on NAO robot CPU), and the accuracy (above 97%) can meet the recognition requirements.

6 Conclusion and Future Work

We propose a simple, efficient, and accurate CNN for object detection (ball) on NAO robots. We train a neural network that learns to find binary values for weights with depthwise convolution. In order to speed up execution on NAO CPU, we present a method of rewriting convolution layer, batch normalization layer and ReLU activation function using SSE. We also present a RoboCup Standard Platform image dataset with annotations, allowing other RoboCup researchers to train new models. The proposed network can detect balls accurately and run on NAO CPU in real-time.

In the future, we may continue focus on investigating more real-time CNNs on NAO robots with new tricks or new network architecture. We may research on the group point-wise convolution as proposed in ShuffleNet [17], as the point-wise convolution takes up a lot of computation in our network. We many also research on concatenating different layers to combine more feature information and further reduce the parameters. As for the RoboCup competition, we may use this network to detect all objects in games (i.e. ball, robot, obstacle, goalpost etc.). We may also use the backbone and similar tricks to build a real-time semantic segmentation algorithm on NAO robots to segment different objects/regions on the field.