Introduction

To supply the nutrition and health needs of the growing population around the world, a major challenge in agricultural communities is to find innovative ways to increase the production of fruits and vegetables (Siegel et al., 2014), especially in the context of rising farming costs and the shortage of skilled labor. Efficient and sustainable agronomic management is one of the effective ways to alleviate this situation, which is required to reduce economic and environmental costs while increasing orchard productivity. Recently, advances in technologies such as robotics and computers provide farmers with means to increase agricultural production in an efficient and sustainable way (Underwood et al., 2016). In addition, these new technologies has been widely applied in the optimization of processes in agronomic management such as irrigation, fertilization, pruning, thinning and deinsectization (Bargoti & Underwood, 2017b; Cheein & Carelli, 2013), through the detection and quantification of fruit distribution in canopy, farmers can obtain valuable information and provide reference for optimizing these processes, which will significantly facilitate the spatial and temporal management of agricultural production.

Among the many links to realize efficient and sustainable agronomic management, vision system as the most fundamental yet important section, used to parse the specified targets from the complex and diverse scenes, has been widely used in many practical applications, such as crop yield estimation (Koirala et al., 2019a), growth monitoring (Fu et al., 2020b), intelligent picking (Bac et al., 2015), disease detection (Zhang et al., 2019), and so on. Design of vision system with the goal of rapid positioning and accurate segmentation will significantly affect the real-time and reliability of these intelligent agriculture applications. However, there are many different types of interference under natural conditions, such as various scales, occlusions, overlaps, illuminations, etc., especially in the monochromatic background, which are all unfavorable to visual system and need to be taken these factors into consideration. Therefore, how to enhance the discriminative ability of vision system regardless of the above interference is crucial and necessary. In this paper, a robust segmentation net framework is specifically designed to segment the overlapped apples from the monochromatic background, which will be more challenging than previous works (Jia et al., 2020b; Zhang et al., 2016).

In recent years, many researchers have been attracted and proposed different methods for the improvement and robustness of detection model in complex orchard scenes. Some methods for detecting used colorspace transformations where the objects of interest stand out, or extraction of features such as shape and texture (Gongal et al., 2015; Jia et al., 2015; Kapach et al., 2012; Liu et al., 2016b; Zhou et al., 2012). In most of these solutions based on hand-crafted features, the discriminative information depends partly on developers, not entirely on algorithms themselves, which may not enough to deal with the level of variability and complexity which commonly appear in natural orchards. In addition, some scholars proposed computer vision solutions based on deep learning network architecture (Chen et al., 2017; Fu et al., 2020a; Jia et al., 2020a; Li et al., 2021; Vasconez et al., 2020). Although these methods can deeply mine the characteristics of targets by themselves, some inconspicuous targets are easily disturbed by dominated salient objects and cause wrong judgments. This situation is sharper when recognizing overlapped fruits at monochromatic background, which still cannot meet the needs of real-world application.

Through the analysis of the above problems, the objective of this study linking image processing with agronomic management was to develop a model architecture with strong robustness to segment apples regardless of interference caused by sensors and natural orchard elements. The whole network framework can be divided into three parts: (1) Feature Acquisition, (2) RoIs (Region of Interests) Generation and (3) Results Prediction. Firstly, the pipeline of ‘Feature Acquisition’ also consists of three steps: extraction, fusion and refining, which are respectively performed by residual network (ResNet) (Targ et al., 2016), feature pyramid network (FPN) (Kim et al., 2018) and balanced feature pyramid (BFP) (Pang et al., 2019). The features of each image were extracted by ResNet and fused by FPN successively, which can make different scales caused by diverse factors (occlusion, camera distance and angle, etc.) be all well perceived. Sequentially, BFP strengthen the features from FPN with the embedding of Gaussian non-local attention mechanism, which can retain more semantic information of inconspicuous object by selectively integrating the similar features rather than simple contextual embedding. Then, at the stage of ‘RoIs Generation’, the region proposal network (RPN) (Ren et al., 2017) takes the features refined by BFP as input and outputs a set of rectangular object proposals on original images, each with a score that belongs to the foreground. Afterwards, the RoI Align layer convert the features inside any valid region of interest into a small feature map with a fixed spatial extent of \({\text{H}}\times {\text{W}}\), where \({\text{H}}\) and \({\text{W}}\) represent the height and width of RoIs respectively. The RoIs with same size are transported into three branches of ‘Results Prediction’ for class probability, bounding box (bbox) regression and mask generation respectively. Finally, based on the results generated by three branches and combine with them, the model will get the final segmentation results.

It should be noted that the method is more effective and flexible than previous methods which also based on network architecture when dealing with complex and diverse scenes. Specifically, some fruits are inconspicuous or incomplete due to lighting and occlusion. If simple contextual embedding is explored, the semantic information from dominated salient object (e.g. leaves, branches) would harm those inconspicuous objects labeling near the edge. By contrast, the embedding Gaussian non-local attention module selectively aggregates the similar features of inconspicuous objects to highlight their feature representations and avoid the influence of salient objects. In addition, through the way that explicitly take spatial relationships into account, so that images understanding for segmentation could benefit from the whole building long-range dependency. Compared with previous published work, the current work contributes to the development of a solution for vision system in agronomic management by examining the hypothesis that Gaussian non-local attention mechanism can be easily embedded into deep learning based vision model and effectively improve the accuracy and robustness of fruit detection by aggregating the similar features of inconspicuous objects through the image. In general, this study offers at least the following contribution as:

  1. (I)

    Gaussian non-local attention mechanism is embedded to focus on the informative pixels but also suppress the noise.

  2. (II)

    The proposed methods outperform the start-of-the-art models in terms of both accuracy and robustness, which could be more suitable for detecting fruits in complex scenes.

  3. (III)

    Provide valuable reference for practical application of other fruit detection and segmentation methods.

The rest of this paper is organized as follows: Firstly, Sect. 2 briefly outlines the breakthrough of related works and unresolved issues. Next, Sect. 3 introduces image acquisition and related dataset processing and annotations. The detailed improvement of model architecture and whole network’s pipeline will be illustrated in Sect. 4. In Sect. 5, the experiment is shown to validate that the method outperforms others from different perspectives including precision, recall and robustness. Finally, Sect. 6 summarizes the characteristics of the proposed method and elaborates the other unsolved problems in this field, which will be the future research directions.

Related work

Design of vision system with the goal of rapid positioning and accurate segmentation is a very challenging task. This is due to various complicated and changeable situations in natural orchards. For example, occlusions and overlaps will lead to incomplete shape features, angle and intensity of illuminations will lead to the indistinct texture features, etc. In the domain of agriculture, earlier work about this used “classical” machine vision techniques, involving detection, classification and segmentation tasks based on hand-crafted features. For example, Ji used SVM classifier to classify and recognize apples, and the recognition rate of bagged apples reached 89%, however, it took 352 ms to recognize an image, the recognition efficiency was not enough high and could not meet the real-time requirements (Ji et al., 2012). Tian proposed an optimized graph-based recognition algorithm by utilizing depth images and paired RGB images without extra manual labeling, which achieves both accuracy and speed improvement, but there were obvious defects in the segmentation of overlapping and clustered apples (Tian et al., 2019a). Liu proposed a recognition method for bagged apples based on block classification, watershed algorithm was employed to segment original images into irregular block, and then SVM divided these blocks into fruit blocks and non-fruit blocks, which can restrain the interference of light efficiently (Liu et al., 2018). Rakun used to recognize apples by combining texture and color features, but the bags and drops on the apples would weaken or even change the features and make it difficult to recognize them (Rakun et al., 2011). Bargoti and Underwood proposed a pipeline for mango and apple detection and counting. They used a general-purpose image segmentation approach with two feature learning algorithms—convolutional neural networks (CNN) and multiscale multilayered perceptrons (MLP). Their approaches were designed to include contextual information about how the image data were captured. Circular Hough transform (CHT) and watershed segmentation (WS) algorithms were used to detect and count individual fruits from the pixel-wise fruit segmentation (Bargoti & Underwood, 2017a, 2017b). Linker proposed a yield prediction model specifically for night-time apple images. In addition, the classifier trained with images from one dataset was successfully applied to the second dataset, and the same prediction effect as the previous work was achieved (Linker, 2018). Hung demonstrated a generalised multi-scale feature learning approach to multi-class segmentation of tree crops. The segmentation results were applied to the problem of fruit counting and compared against manual counting, which shows a squared correlation coefficient of R2 = 0.81 between the two (Hung et al., 2015). Similarly, there are other methods to realize fruit detection by combining color, texture, shape and other features (Aggelopoulou et al., 2011; Kurtulmus et al., 2011; Wang et al., 2013), these ‘classical’ machine vision based methods rely heavily on hand-crafted features to refine discriminative information, so that they could not yet comprehensively consider more aspects into account, which will be eliminated in complex real-world environment.

With the gradual maturity of deep learning, it has become prevalent to migrate this novel revolution to various professions for better results. This also has stimulated the development of vision system in precision agricultural field. More recent works draw support from deep learning due to its various flavors and strong adaptability. Gené-Mola used RGB-D cameras to collect geometrical information with color data and adapted Faster R-CNN model for use with five channels input images including color (RGB), depth (D) and range-corrected intensity signal (S). Results show an improvement of 4.46% in F1-score when adding depth and range-corrected intensity channels, which can be concluded that the RGB-D sensors give valuable information for fruit detection (Gené-Mola et al., 2019). Li optimized U-Net with gated and atrous convolution to make the model more suitable for small apple segmentation in monochromatic background, and the recognition time is 0.39 s. However, the optimized U-Net still belongs to semantic segmentation model, which can only achieve pixel-level classification instead of instance-level. This results in overlapping and clustered fruits being divided into an area, which is not suitable for fruit counting and picking (Li et al., 2021). Koirala compared six existing deep learning architectures for the task of mango detection, and developed a new architecture named MangoYOLO based on features of YOLOv3 and YOLOv2(tiny) on the design criteria of accuracy and speed. The MangoYOLO achieves a F1 score of 0.968 with a detection speed of 8 ms, which realizes a good trade-off between speed and accuracy (Koirala et al., 2019b). Tian employed the improved YOLOv3 architecture to detect apples during different growth stages. Images of young apples, expanding apples and ripe apples were initially collected and subsequently augmented. These augmented images were sent into DenseNet for feature extraction.(Tian et al., 2019b). Sa adapted an object detector by using a Faster R-CNN through transfer learning with images obtained from two modalities—color (RGB) and Near-Infrared (NIR). However, this model used vgg16 as the backbone of the whole model pipeline to extract features, the large capacity (more than 500 megabytes) of the model may make it difficult to deploy the model to mobile agriculture devices (Sa et al., 2016). Rahnemoonfar and Sheppard trained a CNN using features on multiple scales based on an Inception-ResNet architecture. The model was trained with synthetic images and tested on real images of tomato plants, reaching 91% of accuracy. However, such network was tested using 128 × 128 pixel images, which may not take into account important features from the fruit due to the low resolution (Rahnemoonfar & Sheppard, 2017). Vasconez tested the effect of two most common CNN detectors (Faster-RCNN with Inception and SSD with MobileNet) in fruits detection, and compared the results of the two models on three fruits datasets (avocado, apple and lemon). Extensive experiments provide insightful analysis of the usability of such technique in fruit counting tasks in groves, which can lead to further improve the decision making process in agricultural practices (Vasconez et al., 2020). Though improved network enhanced the feature propagation and reusability, research data shows that the algorithm would still easily affected by occluded objects.

Although many researches in this field have made breaks through various ways, the above methods could only realize one of the functions between detection and segmentation, Mask R-CNN (He et al., 2017; Wei et al., 2018) provides a framework for prediction of both the bounding box and pixel mask for each object with only adding a mask branch on Faster R-CNN (Ren et al., 2017), which can efficiently eliminate interference caused by overlaps and occlusions. Considerable amount of researches about Mask R-CNN based methods are under study and gain some progress. Jia improved the backbone of Mask R-CNN by combining ResNet with DenseNet, which greatly reduce input parameters and efficiently strengthen feature extraction (Jia et al., 2020b). Yu applied Mask R-CNN for detecting strawberries and get ideal effect in terms of both robustness and universality, particularly for inconspicuous fruits (Yu et al., 2019). Otherwise, many works about attention mechanism (Chen et al., 2016; Fu et al., 2019a, 2019b; Wang et al., 2018) also make great breakthroughs, which can provide references for detecting overlapped apple from monochromatic background. Inspired by these two innovations and combined with the goal of segmenting occluded apples in monochromatic background, the current work proposed an improved model based on Mask R-CNN by embedding the Gaussian non-local attention mechanism for better focusing on the informative pixels but also suppressing the noise.

Dateset generation

Images acquisition

The images were acquired at the Longwangshan apple production base in Fushan District, Yantai City, Shandong Province (agricultural information technology experimental base of Shandong Normal University) with 6000 × 4000 pixel resolution. Generally, the performance of models based on deep learning relies heavily on datasets, in order to satisfy the diversity of overlapped fruit recognition and minimize variability in lighting conditions due to direct sunlight or cloud cover, images was taken at multiple directions and multiple time interval (morning, noon and evening). Totally, 268 apple images with different illumination and different amounts were collected. Otherwise, considering that the detection of overlapped apple in the monochromatic background is taken as the research object of this paper, these collected images contains a large proportion of apples occluded by leaves and branches or overlapped by another fruits. Specifically illustrated in the first two rows of Fig. 1.

Fig. 1
figure 1

ad represent the images taken in different time intervals and different illumination angles; e and f show different types of occlusion (inter-fruits overlapped, leaves occlusion, branches occlusion and their combination); i is the ground truth corresponding to this image, and jl are both segmented by Mask R-CNN which equipped with same weights and configurations. Apparently, as the fog gets thicker, the segmentation effect gets worse, which commonly appears in most segmentation methods. mp show four of six corruption types (gaussian noise, impulse noise, brightness, fog, snow and contrast) with middle severity

Images annotation and dataset production

Although the aforementioned factors have benn taken into account when collecting photos, some literatures (Dodge and Karam, 2016; Michaelis et al., 2019) have proved that most standard detection and segmentation models encounter a serious detection loss when images getting corrupted (down to 30–60% of the original), which are inevitable caused by sensor degradation or poor weather and extremely unfavorable for real-world applications of agronomic management. For example, the state-of-the-art segmentation algorithm such as Mask R-CNN failed to segment partial apples when the fog gets thicker (as shown in the third row of Fig. 1), even though the apples are still clearly visible to human eyes, which means that the vision system will be a bad alternative of manual labor if the robustness of models cannot get improved. Considering that the ability of the model to detect apples regardless of image distortions is also crucial for real-world application of agronomic management, several data augmentation modes were employed to mitigate the severe performance degradation which usually caused by hardware-degraded or poor whether environment in actual application. Otherwise, in order to increase the networks capability of generalizing and reduce the probability of overfitting, the training set was corrupted with six image distortions, each spanning three levels of severity, as data augmentation (as shown in the last row of Fig. 1). In addition, in order to adapt the trained algorithm to target recognition at low resolution more, the original images were cropped to the size of 4000 × 4000 pixels and further downscaled to 512 × 512 pixels. Finally, after filtering and cropping, 268 images were manually annotated using the labelme annotation tool. And without any additional labelling costs or architecture changes. 2813 images containing 5831 apples were totally generated for model training and 1000 images containing 2142 apples were totally generated for model evaluation. More detailed data set information is shown in Table 1. Otherwise, it should be noted that the training process drew support from transfer learning by migrating the pretrained model weights to RS-Net architecture before formal training, in which the pretrained weights were obtained by extracting the 1586 images containing 5851 apples from MS COCO (Lin et al., 2014) and then trained on RS-Net model. Through pre-training on these extracted images, model could accelerate convergence speed and get better performance.

Table 1 Image acquisition and data set division

RS-Net

Mask R-CNN is a start-of-the-art instance segmentation algorithm which extends many previous excellent researches works (Shelhamer et al., 2017). This approach efficiently detects objects while simultaneously generating a high-quality segmentation mask for each instance in an image. In this paper, RS-Net is extended by original Mask R-CNN and make it more suitable for the segmentation of overlapped fruits in complex scenes. The overall pipeline of RS-Net is shown in Fig. 2, It consists of three-part: (1) Feature Acquisition, (2) RoIs Generation, and (3) Results Prediction. Firstly, the pipeline of ‘Feature Acquisition’ consists of three steps: extraction, fusion and refining, which are respectively performed by ResNet, FPN and BFP (specifically in Fig. 3). Then, based on the features generated by BFP, RPN produces abundant anchors on original images and outputs a set of object proposals that have been initially filtering. Finally, the mask is generated by FCN to indicate the detailed area where the apples are located.

Fig. 2
figure 2

Overview of the improved Mask R-CNN:an overall pipeline design for apple segmentation consists of three parts:(1) Feature Acquisition, (2) RoIs Generation, and (3) Results Prediction

Fig. 3
figure 3

Overall pipeline of ‘Feature Acquisition’ section. Images will be processed continuously as above to get the final finer feature maps (P2–P5) for the next steps. In this figure, feature maps are indicate by different color outlines, and thicker outlines denote semantically stronger features. Detailed pipeline of ‘Attention Module’ is illustrated in Fig. 4

The goal of RS-Net is to focus on the informative pixels but also suppress the noise by selectively aggregating the similar features of inconspicuous fruits, thus exploiting the potential of the proposed model architectures for applying on vision system of agronomic management as much as possible. All components will be detailed in the following sections.

Feature acquisition (ResNet + FPN + BFP)

The overall pipeline of ‘Feature Acquisition’ is shown in Fig. 3. This section can be divided into three parts: extraction, fusion and refining, which are respectively performed by ResNet, FPN and BFP. Specifically, the combination of ResNet and FPN has been widely applied in many detection and segmentation architectures due to its excellent effect of feature representation, which also fits with the research goal of this paper. Generally, the depth of the network is crucial for learning the features with stronger representation ability, but with network depth increasing, it will bring about the problems such as gradient vanishing and explosion, which will lead to model degradation. In case the problems aforementioned, ResNet effectively solves this contradictory phenomenon by explicitly let shallower layers and deeper layers fit a residual mapping, thus improving the discriminative ability of the networks with deeper layers. According to the efficient feature extraction ability of ResNet, RS-Net could better mean and represent the image features on the basis of labeling information.

Generally, the output of last layer of ResNet has been provided sufficient semantic information, but also with the cost of missing detailed information related to object boundaries and resolution due to the consecutive down sampling operations (convolution and pooling), this will make the semantic information of smaller objects seriously diluted and finally cause the detection to fail. Considering that the design of vision system in agronomic management also needs to accurately recognize smaller area apples in an image due to the distance between sensors and objects, FPN is introduced to RS-Net architecture Typically, deep high-level features in backbones are with more semantic meanings while the shallow low-level features are more content descriptive. In other words, low level and high-level information is complementary in terms of semantic meanings and content details. Based on this point, FPN develops a top-down architecture with lateral connections for building high-level semantic feature maps at all scales, so as to improve the final accuracy on small area objects in this way. The details are shown in the left of Fig. 3. Specifically, FPN uses the feature activations produced by each stage’s last residual block of ResNet, and denotes the outputs of these last residual blocks as \(\left\{ {F_{2} ,F_{3} ,F_{4} ,F_{5} } \right\}\) for conv2, conv3, conv4 and conv5 stages. The set of feature maps integrated by FPN is called \(\left\{ {A_{2} ,A_{3} ,A_{4} ,A_{5} } \right\}\), corresponding to \(\left\{ {F_{2} ,F_{3} ,F_{4} ,F_{5} } \right\}\) that are respectively of the same spatial sizes.

Normally, the features via ResNet and FPN can be enough served as the basis for detection and segmentation, but considering two important factors, BFP module is introduced to the architecture for further refining the extracted features. The first point is that a large percentage of apples in collected images are inconspicuous or incomplete due to adverse factors such as lighting, occlusions, overlaps, etc., this will make the semantic information of inconspicuous fruits easily be disturbed by dominated salient object (e.g. leaves, branches) and diluted by consecutive down sampling operations. The second point is that some studies reveals that the best integrated features methods should possess balanced information from each resolution. But the sequential manner in FPN methods will make integrated features focus more on adjacent resolution but less on others. The semantic information contained in non-adjacent levels would be diluted once per fusion during the information flow. Therefore, in order to relieve the two aforementioned dilemmas simultaneously, BFP module is introduced the model architecture, which is illustrated in the right of Fig. 3 and detailed in Fig. 4.

Fig. 4
figure 4

Detailed description of attention module which illustrated in the BFP section of Fig. 3

Features at level \(l\) and the number of features generated by FPN are respectively denoted as \(A_{l}\) and \(L\). The indexes of involved smallest and biggest levels are denoted as \(l_{\min }\) and \(l_{\max }\). In Fig. 3, \(A_{2}\) has the biggest resolution. BFP first rescales the features \(\left\{ {A_{2} ,A_{3} ,A_{4} ,A_{5} } \right\}\) to an intermediate size \(A_{4}\), with interpolation or adaptive max-pooling operation respectively. Finally, the balanced semantic features are obtained by simply averaging as:

$$A = \frac{1}{L}{\kern 1pt} \,\sum\limits_{{l = l_{\min } }}^{{l_{\max } }} {A_{l} }$$

Through this simple procedure as Eq. (1) shown, each feature level contains equal information from others by resizing and averaging operations without any extra parameters. Next, the balanced semantic feature \(A\) will be further refined to get more discriminative by embedded Gaussian non-local operation, Firstly, a general formula for non-local operation is defined as Eq. (2):

$$Ei = \frac{1}{C\left( x \right)}\sum\limits_{{\forall {\text{j}}}} {f\left( {Ai,Aj} \right)g(A{\text{j}})}$$

Here \(A \in R^{C \times H \times W}\) is the balanced semantic feature map and \(i\) denotes the position index whose similarity map will be computed, \(j\) denotes the index that enumerates all positions of \(A\). \(f\) is the pairwise function to compute a scalar that represent the relationship between \(i\) and all \(j\). \(E\) is the output signal of point \(i\) and with the same spatial size of \(A\). The unary function \(g\) computes a representation of \(A\) at the position \(j\), for simplicity, only consider \(g\) in the form of a linear embedding:\({\text{g}}\left( {Aj} \right) = Dj = WgAj\), where \(W{\text{g}}\) is a weight matrix to be learned and implemented with 1 × 1 convolution. As for pairwise function, BFP employs embedding Gaussian function to compute the similarity.

Specifically, non-local operation first feeds \(A\) into 1 × 1 convolution layers(\(\theta\) and \(\varphi\)) to generate two new feature maps \(B\) and \(C\), respectively, where{\(B\), \(C\)}\(\in R^{C \times H \times W}\). Then it reshapes them to \(R^{C \times N}\), where \(N = H \times W\) is the number of pixels. After that BFP performs a matrix multiplication between the transpose of \(B\) and \(C\), and applies a softmax layer to calculate the correlation intensity matrix between any two points \(S \in R^{N \times N}\):

$${\text{s}}ij = {\kern 1pt} \quad \,{\kern 1pt} \frac{\exp (Bi \bullet Cj)}{{\sum\nolimits_{j = 1}^{N} {\exp (Bi \bullet Cj)} }}$$

where \(s_{ij}\) measures the relationship between \(i^{th}\) position and \(j^{th}\) position. The more similar feature representations of the two positions contributes to greater correlation between them.

Meanwhile, non-local operation also feeds feature \(A\) into another convolution layer \(g\) to generate a new feature map \(D\) \(\in R^{C \times H \times W}\) and reshapes it to \(R^{C \times N}\). Then non-local operation performs a matrix multiplication between \(D\) and the transpose of \(S\) and reshapes the result to \(R^{C \times H \times W}\). Finally, non-local operation performs a element-wise sum operation with the features \(A\) to obtain the final output \(E \in R^{C \times H \times W}\) as follows:

$$E_{i} = \sum\limits_{j = 1}^{N} {\left( {s_{ij} \cdot D_{j} } \right)} + A_{i}$$

It can be inferred from Eq. (4) that the resulting feature \(E\) at each position is a weighted sum of the features across all positions and original features. Therefore, it has a global contextual view and selectively aggregates contexts according to the correlation intensity matrix \(S\). The similar semantic features achieve mutual gains, thus improving semantic similar information but also suppressing noises.

RoIs generation

For each feature map Pi in {P2, P3, P4, P5} generated by last stage, it will be input into the RPN (Fig. 5) to generate abundant anchors of different shapes, which are mapped to different apple shapes caused by overlaps and occlusions as possible. Then RPN initially filters the generated anchors given the probability of being a foreground. The architecture of RPN just consists of one 3*3 convolutional layer and followed by two 1*1 convolutional layers (for regression\classification, and denoted as reg\cls respectively), which is nearly cost-free given detection network computation. Concretely, 3*3 convlutional layer could be seen as a sliding-window to traverse all points at Pi, at each sliding-window center location, RPN simultaneously predicts multiple region anchors at original images. Considering that FPN has been adopted to alleviate scale variation, thus RPN only employs single area scale 8*8 with three aspect ratios (1:2, 1:1, 2:1) for each feature map level. For a convolutional feature map of a size W*H, there are 3*W*H anchors in total. Sequentially, cls is responsible for predicting the probability of each anchor being an foreground and reg is responsible for predicting a 4-D vector representing the 4 parameterized coordinates of the predicted bounding box for each anchor. Finally, Non-Maximum Suppression (NMS) is applied to filter out partial anchors based on the confidence scores predicted by cls and bbox offsets predicted by reg. The remaining anchors are the outputs of RPN, which are called ‘proposals’. The embedding of RPN just makes the extra cost of two convolutional layers but act an important role in the while network structure.

Fig. 5
figure 5

Detailed description of RPN. 256-D represents a 256 dimensional vector after 3*3 convolution at each spatial location in feature map

Due to the proposals were generated at original images, model should map them into corresponding level to get features inside proposals, which are called Regions of Interest (RoIs). Since there are multiple feature maps owing to FPN, RoI Align layer needs to assign proposals of different scales to the certain pyramid level. Formally, the corresponding relationship between proposal (with width w and height h) to the level Pk of feature pyramid by:

$$k = \left\lfloor {k_{0} + \log_{2} \left( {\sqrt {wh} /512} \right)} \right\rfloor$$

Here 512 is the uniform image size, and \(k_{0}\) is the target level on which a proposal with \(w \times h = 512 \times 512\) should be mapped into. Intuitively, Eq. (5) means that if the area of proposal become bigger, it should be mapped into a coarser-resolution level. Next, RoIs are fed into RoI Align layer improved from spatial pyramid pooling (SPP) for stretching them to same scale, which removed the harsh quantization of RoIPool and will play a key role in the next mask prediction.

Results prediction

A RS-Net has three sibling output branches with different tasks for final predictions. The first outputs a probability distribution (per RoI) of being an apple. Although in the task of current work, only one category needs to be identified, the comprehensive evaluation metric AP which will be explained next needs the probability value to calculate precision and recall over each intersection of union (IoU) threshold, thus the model retains this branch for model evaluation and intuitive comparison with other methods. The second sibling layer outputs bounding-box regression offsets for adjusting proposals. Finally, the third branch employs Fully Convolution Network (FCN) at each RoI to achieving instance segmentation task. Specifically, this branch predicts a m*m mask from each RoI using an FCN without collapsing it into a vector representation that lacks spatial dimensions and make a pixel-wise prediction for each point in RoI through up- and down-sampling continuously. By combining the prediction results of three sibling branches, the final segmentation targets are obtained.

Implementation details

Since a lot of hyper-parameters are needed in the implementation process, and the results are sensitive to the setting of these elements, thus these hyper-parameters are found for better segmentation performance by trial and error empirically.

In the training phase, the whole architecture can be trained end-to-end by stochastic gradient descent (SGD) and back propagation. Images first normalized with mean = [0.50, 0.42, 0.34]and std = [0.28, 0.27, 0.28] which are calculated from training dataset. ResNet50 is used as the main backbone to reduce the running time and publicly available. For each iteration, employ 2 images as a batch and BN (batch normalization) while updating weights. Initial learning rate, momentum and weight decay is set to 0. 0025, 0. 9 and 0. 0001 respectively, decrease it by 0. 1 after 8 and 11 epochs respectively if not specifically noted. Set base anchor scales and aspect ratios as 8 and [0. 5, 1, 2] while training RPN. As for loss function, the overall could be mainly divided into two parts: the losses of classification and offset regression from RPN section, and the multi-task losses from ‘Results Prediction’ section which includes classification branch, coordinate regression branch and mask segmentation branch. As shown in below:

$$\begin{aligned} L_{final} & = L_{RPN} + L_{Results - Prediction} \\ & = L_{cls1} + L_{reg1} + L_{cls2} + L_{reg2} + L_{mask} \\ \end{aligned}$$

Here \(L_{final}\) denotes the final loss which will use for back propagation, \(L_{RPN}\) consists of \(L_{cls1} ,L_{reg1}\) and \(L_{Results - Prediction}\) consists of \(L_{cls2} ,L_{reg2}\) and \(L_{mask}\) represent losses from RPN section and Results Prediction section respectively. Specifically, model employs cross entropy loss function for \(L_{cls1}\), \(L_{cls2}\) and \(L_{mask}\), and L1 loss function for \(L_{reg1}\) and \(L_{reg2}\). For each feature level generated by BFP, 256 anchors are randomly sampled as a mini-batch for computing \(L_{RPN}\), where sampled negative anchors and positive anchors with a 1:1 ratio. Replace the batch with negative ones if there are fewer than 128 positive samples in the original image.

Experiments

Evaluation metric

In order to evaluate the detection performance more comprehensively and strictly, \(AP\)(average precision) is employed as main evaluation metric which averages the precision values calculated over IoUs from 0.5 to 0.95 with an interval of 0.05. Firstly, define \(I\) as a set of equally spaced IoUs thresholds levels [0.5, 0.55, …, 0.95]. For each threshold i in \(I\), if the IoU between predicted bbox and the matched ground truth exceeds \(i\), this example is defined as true positive (TP) example, else, as false positive (FP), and the ground truth which are not detected successfully by detector is defined as false negative (FN). Then, at most the top 100 predicted bboxes given confidence scores are selected and then used to calculate the precision (P) and recall (R) (Eq. 7) pair corresponding to sorted confidence thresholds in turn, thus the precision/recall pairs over a specific IoU threshold and multiple confidence thresholds are calculated.

$$P = \frac{TP}{{TP + FP}}\quad R = \frac{TP}{{TP + FN}}$$

AP over a specific IoU threshold \(i\) could be seen as the approximate area under the precision/recall curve (\(AUC\)), and is defined as the mean precision at a set of 101 equally spaced recall levels \(R\): [0, 0. 01,..., 1]:

$$AP^{IoU = i} = \frac{1}{101}\sum\limits_{r \in R} {p_{interp} \left( r \right)}$$

The precision at each recall level \(r\) is interpolated by taking the maximum precision measured from which the corresponding recall exceeds r:

$$p_{interp} \left( r \right) = \mathop {\max }\limits_{{\tilde{r}:\tilde{r} \ge r}} p\left( {\tilde{r}} \right)$$

where \(p\left( {\tilde{r}} \right)\) is the measured precision at recall \(\tilde{r}\). Similarly, all \(AP^{IoU = i} \left( {i \in I} \right)\) could get by following the above steps and the final evaluation metric \(AP\) could be formulated as:

$$AP = \frac{1}{10}\sum\limits_{i \in I} {AP^{IoU = i} }$$

The factor “10” corresponds to the number of the IoUs thresholds tested in set I. Intuitively, AP evaluates the result over different IoU thresholds, confidence scores, precisions and recalls, thus can measure RS-Net accurately and comprehensively. Both box AP and mask AP are evaluated. In addition, AR (average recall) is also used as an evaluation metric, which is obtained by taking the average value of ARIoU=is over 10 IoU thresholds tested given the top 100 predicted bboxes at most. Since the task of the model only needs to identify one category, ARIoU=i under a specific threshold is equal to R in Eq. (7). More information about evaluation metrics please refer to MS COCO for detailed explanation.

Model training

Totally, 2813 images containing 5831 apples are used for training process. RS-Net is trained with 12 epochs and a total of 16,884 iterations (2 images/iteration). In addition, despite dataset is extended over different corruptions, due to that there is only one category, which makes the training process easier to overfitting. To eliminate this hidden trouble and accelerate network convergence, RS-Net is pretrained over 1586 images which extracted from MS COCO dateset without extra annotation works and then loads the pretrained weights into model architecture as initialization parameters for formal training. Intuitively, the loss value curve changes with iterations on the above two situations are illustrated in Fig. 6.

Fig. 6
figure 6

Loss value curve changes with iterations. Thicker curve represents training process equipped with pretained weights and thinner curve represents no pre-training

Obviously, thicker curve begins at a remarkably smaller value than thinner one and the loss value is about 0.1 smaller when the end of two curves tend to be stable. It can be inferred from this figure that the formal training of model can get benefit from 1586 images used for pre-training, which makes the model learn more distinguishing features and less risk of over-fitting. Comparing the obvious gap between two results, the pre-training way is adopted to carry out the following processes for better performance.

Ablation experiments

For fair comparisons in ablation experiments and validate the effect of attention module, experiment employs original Mask R-CNN built on MMDetection v2.0 (Chen et al., 2018) as baseline and both same hyper-parameters as RS-Net except section of BFP. Since Mask R-CNN and RS-Net both have a relatively good segmentation effect, thus experiment directly employs IoU = 0. 90 as strict threshold for defining a bbox as TP or FP to measure the high-quality effect gap between the baseline and RS-Net. Table 2 lists the specific comparison results of two methods.

Table 2 Specific comparison results of two methods

As shown in Table 1, except for average precision, RS-Net can also enhance the average recall of predicted bboxes. The embedding of attention module brings 3.7 points higher \(AR_{{90}}^{{box}}\) and 2.9 points higher \(AR_{{90}}^{{mask}}\) compared with baseline method. This phenomenon is due to that BFP could make the similar semantic features achieve mutual gains across inconspicuous and salient apples, the heavily inconspicuous apples caused by overlapped, occlusion and illumination could get help from salient apples, thus the proportion of boxes judged as TP in the whole ground truth will rise and AR metrics will be significantly improved. Several images containing heavily incomplete apples are visualized for intuitively feeling the gap between the two in Fig. 7.

Fig. 7
figure 7

Visualization of comparison results over two methods. Ellipses represent apples that were labeled as ground truth, and the baseline method did not detect it successfully but RS-Net did. Circles represent apple that were not labeled as ground truth due to severe occlusion, but RS-Net still detected it

As shown in the above figure, two methods both have good segmentation effect when detecting conspicuous apples, However, due to the attention mechanism employed in RS-Net, severely occluded apples can also be well segmented by drawing information from salient parts, which even includes the severely occluded apples that are not labeled as ground truth. This is also the reason why RS-Net get higher metric values. Therefore, RS-Net is better in segmenting overlapped apples in the same color background and more suitable for deploying on vision system of harvesting robot.

Comparison with state-of-the-arts methods

For further validation of the improved Mask R-CNN, experiments compare the proposed model with the state-of-the-art detection and instance segmentation methods with identical experimental configuration. It should be noted that all experiments reported in current work are tested on the same environment equipped with Tesla V100 GPU, CUDA V10.0, and Pytorch 1.4 for studies.

Detection effect

Since the main body of RS-Net is extended on the detector architecture by adding a mask branch, and the mask segmentation is operated based on the predicted boxes, in other words, the detection effect of the model directly affects the segmentation effect, thus experiments first compare the detection effect of RS-Net with the start-of-the-art detectors. As for evaluation metrics, in addition to using the box Average Precision (AP box) metric which averages APs across IoU thresholds from 0.5 to 0.95 with an interval of 0.05, \(AP_{{50}}^{{box}}\) and \(AP_{{90}}^{{box}}\) (AP at different IoU thresholds) are reported as loose and strict boundaries, respectively. The specific comparison results are shown in Table 3.

Table 3 Comparison with state-of-the-art detection methods on validation dateset

Intuitively, the detection effect of original Mask R-CNN outperforms the other advanced detectors with the same extraction capability of backbone while detecting on test set. It achieves 7.4%, 1.1% and 3.9% APbox gains compared with SSD (Liu et al., 2016a), Faster R-CNN, RetinaNet (Lin et al., 2020) respectively. By adding attention module to naive Mask R-CNN, the improved Mask R-CNN obtains further performance which brings 1.1%, 1.6% and 3.8% gains in terms of APbox, \(AP_{{50}}^{{box}}\) and \(AP_{{90}}^{{box}}\). Comprehensively, from the above analysis, RS-Net achieves better detection effect, which could be more suitable and robust for deploying on vision system of apple harvesting robots.

Segmentation effect

Due to the aim of this paper is to explore the ability of the model to segment overlapped fruits in the same color background, thus in-depth comparative experiments with start-of-the-art instance segmentation methods are carried out and experimental results of them are analyzed to validate the effectiveness of RS-Net. The specific comparison results are shown in Table 4.

Table 4 Comparison with state-of-the-art instance segmentation methods on validation dateset

All models employed ResNet50 as feature extractor for fair comparison. In contrast, RS-Net achieves the best results in terms of both box AP and mask AP metrics. In particular, compared to RetinaMask (Fu CY et al., 2019) which has similar architecture (detector + mask branch), RS-Net achieves 2.8% APbox gain and 2.6% APmask gain respectively. Otherwise, it should be noted that the gap between APmask, \(AP_{{50}}^{{mask}}\) and \(AP_{{90}}^{{mask}}\) is smaller than YOLACT (Bolya et al., 2019), YOLACT +  + (Bolya et al., 2020) and RetinaMask, which means that the most masks segmented by RS-Net are concentrated in high quality area (higher IoU with ground truth). In order to more intuitively feel the effect of RS-Net, several representative images which containing different numbers of apples are selected and used different methods to segment. The visualization results are shown in Fig. 8. Obviously, the effect of the proposed method is much better than other methods in terms of both recognition accuracy and segmentation effect. In addition, since RS-Net introduces attention module into architecture, most heavily overlapped apples are also well segmented, including some are not even labeled as ground truth.

Fig. 8
figure 8

Visualization results of different test images segmented by RS-Net, RetinaMask, YOLACT and YOLACT +  + respectively

Conclusion

In order to effectively detect overlapped fruits in natural environment, RS-Net architecture which extends Mask R-CNN by adding an embedding Gaussian attention module is proposed, thus make the similar semantic features achieve mutual gains and reduce the impact of adverse factors such as occlusion, illumination, overlapped, etc. The experimental result shows that the proposed RS-Net outperforms other start-of-the-art deep learning-based methods when applying on vision system of harvesting robot, and achieve both higher accuracy and stronger robustness, which could be more suitable for operating in real-world scene for harvesting robot’s vision system.

Although RS-Net has achieved relatively ideal experimental results, there are still some aspects and rooms that need to be improved continuously in architecture. For example, the average segmentation time of each image with size 512 \(\times\) 512 on GPU is 65.79 ms, while the fastest model in experiments (YOLACT + +) only needs 20.43 ms. The shortest inference times of other researches which also reported on 512*512 resolutions in fruit detection and segmentation field need 15 ms (Koirala et al., 2019b) and 390 ms (Li et al., 2021) respectively. In contrast, the inference time of RS-Net is in a lower middle position. Though this has been able to meet the real-time needs of practical deployment, it is still a little longer than other methods in terms of time-consuming. This phenomenon is suspected to be caused by two reasons: (1) The Faster R-CNN which Mask R-CNN extends is a two-stage architecture for better accuracy, which will inevitably lead to bigger consumption of computation and time than other one-stage methods. (2) The RS-Net is anchor-based for achieving a higher recall rate, which will require the model to densely place anchor boxes on the original images, it also leads to more time-consuming. Based on this defect, extending mask branch to other one-stage or anchor-based detection methods is considered to strike the better trade-off between speed and accuracy simultaneously in the future works.