Keywords

1 Introduce

With the wide application of deep learning in the field of computer vision [1, 2], increasingly researchers are engaged in intelligent agriculture-related work. At present, practical problems in the agricultural production processes (such as pest prediction and monitoring, automatic harvest, etc.) have been solved by many network model methods. However, with the improvement in operating efficiency and upgrading of agricultural equipment, the requirements for real-time performance and accuracy of operating machines have also gradually increased [3,4,5], and the requirements for agricultural equipment vision system are also higher and higher [6, 7]. In the complex orchard environment, the results of target fruit detection [8, 9] and segmentation [10,11,12] limit the performance of the visual system. Such as apple density (Fig. 1)(a)), illumination angle change (Fig. 1)(b)), branch and leaf occlusion (Fig. 1)(c)) and overlapping apple (Fig. 1)(d)) will have a certain influence on target detection, which brings great difficulties and challenges to the accurate recognition of fruits.

Fig. 1.
figure 1

Illustration of our framework. (a) Apple density; (b) illumination angle change; (c) branch and leaf occlusion; (d) overlapping apple

To address these deficiencies while considering the above factors, an effective and accurate Apple Instance Segmentation method based on a dual attention-guided network is proposed to improve the segmentation accuracy of apple in complex environments. More precisely, the main contributions of this paper are summarized as follows:

  1. (1)

    Aiming at the problems of illumination, occlusion and overlap in a complex environments, an anchor-free apple instance segmentation method based on dual attention-guided.

  2. (2)

    An adaptive Image Correction Module (ICM) is introduced to enhance the robustness of the image to natural illumination and contrast changes.

  3. (3)

    The Multi-Scale Enhancement Fusion (MSEF) module is introduced into the feature pyramid network (FPN). Its purpose is to enhance the feature extraction ability of the image and reduce the interference of complex background to apple detection results without increasing the amount of calculation.

  4. (4)

    To improve the segmentation accuracy of overlapping and occluding apples in complex environments, a new branch of Dual Attention-Guided Mask (DAGM) is added to deal with fruit occlusion and overlap.

2 Related Works

Traditional machine learning methods have made important contributions to fruit detection and segmentation [13,14,15]. A yield prediction strategy based on texture, fruit color and edge shape was proposed, and the recognition rate of green apples under natural light was close to 95% [16]. Tian used RGB spatial information to locate the center and radius of the apple, and combined with depth image information to match the target area [17]. These methods are not sufficient to identify overlapping or clustered fruits. To solve the above problems, a robust apple image segmentation algorithm based on a fuzzy reasoning system, which improves the generalization ability of segmentation [18]. However, due to the lack of in-depth analysis of image features, the above methods are usually poor in robustness and adaptability in complex environments with occlusion and overlapping.

With the rapid development of deep learning theory, an increasing number of deep learning methods have been proposed for agricultural fruit detection [19,20,21,22,23,24]. Liu used the improved single-stage detector Yolo-V3 to conduct tomato positioning detection in complex scenes [25]. Jia combined DenseNet and ResNet as the feature extraction backbone of the original model to improve the Mask R-CNN, greatly improving the identification accuracy of Apple in the overlapping and occlusion environments [4]. Compared with traditional visual methods, the accuracy and applicability of the recognition model based on deep learning have been greatly improved. However, these methods usually require many computing and storage resources, which will seriously affect the segmentation speed and operation stability of agricultural equipment in practical applications.

3 Methods

Fig. 2.
figure 2

Illustration of our framework. (a) Image Correction Module (ICM); (b) Anchor-free Detection Module; (c) Dual Attention-guided Mask (DAGM) Module

To improve the accuracy and efficiency of apple instance segmentation in complex environments, an accurate and efficient anchor-free apple instance segmentation method based on dual attention-guided network. The framework of our method is shown in Fig. 2, which includes three parts: (a) Image Correction Module (ICM); (b) Anchor-free Detection Module; (c) Dual Attention-guided Mask (DAGM) Module.

3.1 Image Correction Module (ICM)

To cope with the challenge of illumination change, ICM is used to transform images under different illuminations into similar illumination. The module follows IBNNet [26] and realizes image adaptive correction by constructing an encoding and decoding network (as shown in Fig. 3). First, the convolutional neural network is used to extract image features, and then deconvolution is used as a decoder for resampling to restore and correct the image with the same size as the input image. In the deconvolution process, network parameters are trained to ensure that the corrected images have similar illumination intensities.

Fig. 3.
figure 3

Schematic diagram of the image correction module (ICM).

ICM is built on a trainable network, allowing for end-to-end training and adaptive correction of images rather than increasing or decreasing brightness at a specific rate during data enhancement. The basic network structure of ICM is shown in Fig. 3(a). The underlying features of convolutional neural network reflect the appearance features of objects, such as texture and color, while the high-level features reflect the semantic information of the target. Therefore, by adding image normalization in the lower layer of the network, the distribution of image data under different illumination can be adjusted to a similar area, increasing the adaptability of the network to illumination, and thus reducing the influence of illumination changes.

Fig. 4.
figure 4

Architecture of bounding box pre detection. (a) MSEF-FPN Backbone; (b) FCOS Boxes Prediction Head; (c) Multi-scale enhanced feature module (MSEF).

3.2 Feature Extraction MSFM-FPN Detection Network

Our model uses FCOS (shown in Fig. 4) as the basic detection model and improves it. The introduction of MSFM into the lateral connection in FPN solves some defects of FPN. For example, direct fusion of these features may be reduce the representation of multi-scale features due to the inconsistency of semantic information. In addition, in the process of picking, due to the interference of shooting distance, occlusion, or overlap, the proportion of the target in the image is insignificant. After the deep convolution operation on the image, the target feature map will be changed to a small extent, which greatly reduces the spatial information contained in the feature map, thus reducing the detection accuracy. Therefore, to gather multi-scale features and maintain a high-resolution representation in the process of convolution, MSFM is introduced at the lateral connection of FPN to improve the feature extraction capability of the image. Figure 4(c) shows the overall content of MSFM after improvement.

First, we use two weight transformations \( W_{\alpha }X_{q}\) And \( W_{\beta }X_{p}\) to reduce the number of channels and then reduce the amount of calculation. Multiply the two output matrices (where \( W_{\alpha }X_{q}\) will be transposed), calculate the similarity, and then perform the softmax operation to obtain the position attention, that is, the normalized correlation between each pixel in the current feature map and all other position pixels. Finally, by multiplying with \( W_{j}X_{p}\) matrix, the position attention mechanism is applied to the corresponding position of each feature graph of all channels. Restore the output channel through 1\(\,\times \,\)1 convolution to ensure that the input and output scales are exactly the same. The corresponding nonlocal operations are shown in Eqs. (1), (2) and (3).

$$\begin{aligned} f(x_{q},x_{p})=e^{(W_{\alpha }X_{q})^T( W_{\beta }X_{p})} \end{aligned}$$
(1)
$$\begin{aligned} C(x)=\sum _{\forall p}f(x_{q},x_{p}) \end{aligned}$$
(2)
$$\begin{aligned} y_{q}=\frac{1}{C(x)}f(x_{q},x_{p})( W_{j}X_{p})=softmax((W_{\alpha }X_{q})^T(W_{\beta }X_{p})(W_{j}X_{p})) \end{aligned}$$
(3)

3.3 Dual Attention-Guided Netword for Instance Segmentation

Fig. 5.
figure 5

The architecture of DAGM.

Considering that fruits are located in complex environments, many interference factors greatly reduce the segmentation effect of the model. Therefore, we designed a Dual Attention-guided Mask module (DAGM), as shown in Fig. 5. Compared with the general segmentation framework, this method has obvious advantages in the segmentation of small objects with serious occlusion or overlap. At the front-end of the framework, the convolutional network is used for feature extraction, and at the back end, conditional/Markov random fields are used to optimize the front end output, and the segmentation results are obtained.

The DAGM branch applies the boundary box predicted by FCOS to further predict the segmentation mask of each region of interest (ROI). Firstly, the predicted ROI is distributed to different FPN feature layers according to the resolution, and ROI Align is used for feature alignment. This is similar to using the Mask R-CNN to predict the segmentaion mask. However, the relationship between the original image resolution and the ROI size must be considered in order to reasonably allocate the ROI to the feature layer of the corresponding resolution (considering the FPN multi-scale strategy). Secondly, after extracting the features in ROI with \(14\times 14\) resolution in ROI Align, these features are transmitted to Convolutional Block Attention (CBA) network. Specifically, the characteristics are divided into maximum pool and average pool to obtain two groups \(1\times 1\times C\) characteristic matrix and transfer it to MLP, and then add the two output characteristic matrices to obtain the weight information Channel Attention (CA) of different channels. The calculation method is shown in Eq. (4). After CA is multiplied by the input characteristic matrix, the characteristic matrix combined with channel attention is obtained, as shown by Feature X’ in Fig. 5. Then the feature matrix fused with channel attention is passed through \(W\times H\times 1\), and condense the two feature maps in the depth direction, and then perform convolution operation to obtain spatial attention (SA) integrating spatial weight information. The calculation method is shown in Eq. (5). In Eq. (5), \(f^{7\times 7}\) indicates that the size of the pooling kernel is \(7\times 7\). Finally, SA is multiplied by feature X’ to obtain the feature map Refined Feature X”, which combines channel andspatial attention information.

$$\begin{aligned} CA(X)=\varSigma (MLP(maxPool(X))+MLP(avgPool(X))) \end{aligned}$$
(4)
$$\begin{aligned} SA(X)=\sigma (f^{7\times 7}([maxPool(X'); avgPool(X')]) \end{aligned}$$
(5)

Then, the obtained enhanced spatial attention feature map is up sampled to generate a feature map with a resolution of \(28\times 28\). The \(1\times 1\) convolution kernel is used to generate the mask of instance segmentation.

3.4 The Loss Functions

The overall loss function \(L_{total}\) (as shown in Eq. (6)) of the model is composed of \(L_{cls}\), \(L_{reg}\), \(L_{center}\) and \(L_{mask}\), where \(L_{cls}\) is the classification loss, \(L_{center}\) is the center-ness loss, \(L_{reg}\) is the box regression loss, and \(L_{mask}\) is the mask loss using the average binary crossentropy loss

$$\begin{aligned} {\begin{matrix} &{}L_{total}=\frac{1}{N_{pos}}\sum _{x,y}L_{cls}(p_{x,y},p_{x,y}^*) +\frac{\lambda }{N_{pos}}\sum _{x,y}p_{x}^*L_{reg}(d_{x,y},d_{x,y}^*)\\ &{}+\frac{\beta }{N_{pos}}\sum _{x,y}p_{x,y}^*L_{center}(center_{x,y},center_{x,y}^*) +L_{mask}(s_{x},s_{x}^*) \end{matrix}} \end{aligned}$$
(6)

In Eq. (6), \(p_{x,y}\) , \(d_{x,y}\) and \(center_{x,y}\) are the predicted values of classification branch, regression branch and centrality branch at the spatial position (xy). \(p_{x,y}^*\) , \(d_{x,y}^*\) and \(center_{x,y}^*\) correspond to the training target at the spatial position (xy). Among the three loss items, \(L_{reg}\) and \(L_{center}\) are only for positive samples, \(N_{pos}\) is the number of positive samples, and \(\lambda \) and \(\beta \) are the balance coefficients of each loss item.

The classification loss Lcls in Eq. (6) is shown in Eqs. (7) and (8):

$$\begin{aligned} L_{cls}(p_{x,y},p_{x,y}^*)=-\alpha _{t}(1-p_{x,y}^t)^\gamma log(p_{x,y}^t) \end{aligned}$$
(7)
$$\begin{aligned} \begin{aligned}&p_{x,y}^t= {\left\{ \begin{array}{ll} p_{x,y} &{}if \quad p_{x,y}^*=1\\ 1-p_{x,y} &{}otherwise ,\\ \end{array}\right. } \quad \alpha _{t}= {\left\{ \begin{array}{ll} \alpha &{}if \quad p_{x,y}^*=1 \\ 1-\alpha &{}otherwise \end{array}\right. } \end{aligned} \end{aligned}$$
(8)

where \(\alpha \) Responsible for balancing the importance between positive and negative samples, \(\gamma \) responsible for adjusting the rate of weight reduction of simple samples.

The regression loss \(L_{reg}\) in Eq. (6) is shown in Eq. (9):

$$\begin{aligned} L_{reg}(d_{x,y},d_{x,y}^*)=-ln\frac{Intersection(d_{x,y},d_{x,y}^*)}{Union(d_{x,y},d_{x,y}^*)} \end{aligned}$$
(9)

where \(intersection(d_{x,y}, d{x,y}^*)\) and \(Union(d_{x,y},d_{x,y}^*)\) are the intersection area and combined area between the prediction frame and the real frame respectively.

The center-ness loss \(L_{center}\) in Eq. (6) is shown in Eq. (10):

$$\begin{aligned} \begin{aligned} L_{center}(center_{x,y},center_{x,y}^*)=-(center_{x,y}log(center_{x,y}^*) \\ +(1-center_{x,y})log(1-center_{x,y}^*)) \end{aligned} \end{aligned}$$
(10)

The mask loss \(L_{mask}\) in Eq. (6) is shown in Eq. (11):

$$\begin{aligned} L_{mask}=\sum _{x}-[s_{x}^*log(s_{x})+(1-s_{x}^*)log(1-s_{x})] \end{aligned}$$
(11)

where \(s_{x}\) is the probability that the x-th pixel belongs to the target pixel and \(s_{x}^*\) is the probability that the x-th pixel belongs to the real target pixel.

4 Experiment

4.1 Dataset and Evaluation Metrics

Apple Dataset Acquisition. In this paper, we choose the open dataset Fuji SFM dataset, and make appropriate modifications to the data set to cooperate with the experiment of this paper. We select 400 appropriate Apple images from 582 images (the resolution of each image is \(5184\times 3456\)), then cut 15 images with the resolution of \(1024\times 1024\) from each image, and get 6000 images with the resolution of \(1024\times 1024\). Then select the appropriate 1400 images from the 6000 images as the final data set. Finally, in order to make the network model have high accuracy and robustness, we use the mainstream image annotation tool labelme to annotate and store the data set manually. Figure 1 shows some images in the dataset.

Evaluation Metrics. We follow the internationally unified measurement standards and use the AP (average precision), \(AP_{50}\) (AP for IoU threshold 50%) and \(AP_{75}\)(AP for IoU threshold 75%) to measure the quality of the model.

4.2 Implementation Details

In the training stage, we trained a total of 50 epochs with 200 steps each. And the initial learning rate is 0.01. The Network parameters are also optimized using adaptive moment estimation (Adam). The momentum, as well as decay weights, are 0.9 and 0.0001, respectively.

4.3 Comparative Experiments

In this study, apple fruits with different occlusion and different lighting conditions were detected on the computer workstation. The detection effects of Mask R-CNN, SOLO [27], PolarMask [28] and our method under the above conditions were compared, and the performance of the algorithm was evaluated with AP, \(AP_{50}\), \(AP_{75}\).

Comparison Experiment of Overlapping and Branch and Leaf Occlusion. In the natural environment, there will be overlapping fruits and fruits covered by branches and branches. The contour information of the fruit part is lost, which increases the difficulty of fruit detection. Therefore, this study tested the overlapping of fruits and different degrees of branch and stem shielding. The statistical results are shown in Fig. 6 and Table 1.

Fig. 6.
figure 6

Detection effect of 4 algorithms on different occluded apples. (From left to right, there are pictures of the detection results of Original Image, our Model, Mask R-CNN, SOLO and Polar Mask)

Table 1. Experimental results of 4 algorithms for images with different occluded apples.

As can be seen from Table 1, in the apple overlapping scenario, the AP value of our algorithm is 4.9%, 5.1% and 4.6% higher than that of Mask R-CNN respectively. The AP value of the latter two algorithms in both cases is lower than that of the algorithm in this paper. From the comprehensive results, our algorithm can be competent in the detection of different occlusion and overlap.

Fig. 7.
figure 7

Detection effect of 4 algorithms on apples under different illumination. (From left to right, there are pictures of the detection results of Original Image, our Model, Mask R-CNN, SOLO and Polar Mask)

Table 2. Experimental results of 4 algorithms for apples under different illumination.

Comparison Experiment with Different Light. Under the conditions of natural light, back light and side light, the fruit will be brighter or darker. And due to the great influence of dense apple samples, dense apple samples will not be considered when selecting images here.The statistical results are shown in Fig. 7 and Table 2.

As can be seen from Table 2, the AP value of the improved model in three different scenarios is higher than that of the other three algorithms. From different scenes, the four algorithms perform best in side light, while the model performs worst in backlight. Because the texture of the apple is clear under side light, the surface illumination intensity is uniform, and the backlight condition will cause some interference to the detection. Overall, our model can adapt to the influence of lighting conditions on apple surface color, texture features and contour, and can effectively detect apples in complex images.

4.4 Ablation Experiment

In this section, in order to clarify the impact of image correction module (ICM) and multi-scale enhaned fusion feature pyramid module (MSEF-FPN) on the performance of the model, ablation research is carried out, and the role of each module is analyzed in detail. We gradually introduced our module, tested the AP value of each combined model, and obtained the experimental results shown in Table 3. The working mode of each module in the actual environment is discussed below.

As shown in Table 3, removing ICM will reduce the AP of the model by 1.3%. This shows that by adding the image correction module, the images under different lighting can be normalized to similar data distribution, which is equivalent to increasing the robustness of the model to complex environmental lighting. When MSEF-FPN is removed, the AP of the model decreases by 1.5%. This shows that adding MSEF-FPN can enhance the feature extraction ability of the image and reduce the interference of complex background to Apple detection results without increasing the amount of calculation.

Table 3. The results of Ablation experiments.

5 Conclusion

In this paper, we propose a new instance segmentation method based on dual attention-guided network for Apple instance segmentation in complex environments, which solves the constraints of illumination, occlusion and overlapping changes in environments, so as to realize the visual guidance of automatic picking. The CNN model with an image correction module and a instance segmentation module is constructed to meet the challenges of illumination, occlusion and overlap in complex environments. Experimental results show that the proposed algorithm performs better performance than the previous algorithms in instance segmentation. This enables the model to be deployed on the apple picking robot detector for automatic Apple detection.