Introduction

China is one of the largest fruit producing countries in the world. Following the grain and vegetable industries, the fruit industry has developed into the third largest agricultural plantation industry [1]. However, compared with developed countries, most of the fruit picking and post-harvest processing task rely on high-cost and inefficient manual work, which restricts the automation of the fruit industry, and results in backward of post-production commercialization processing technology [2]. With the acceleration of population aging process and urbanization, labor shortage will also bring difficulties and challenges to the sustainable development of the fruit industry. Information technology-supported precision agriculture provide a better solution for these problems. Automated intelligent equipment and machinery can be used instead of manual labor to complete tasks such as automatic fruit picking, accurate yield measurement in orchards and fruit sorting, thus saving labor and material expenses. Fruit detection and recognition is a key technology to realize vision system in precision agriculture, which can provide category information and location information for intelligent agricultural equipment.

The traditional fruit detection algorithm is based on color, geometry, texture and other features to classify and detect fruit targets, including three parts: region selection, feature extraction and classification, which is relatively mature but not robust to uneven illumination, fruit occlusion and similar color of fruit and background, and cannot meet the real-time requirements of application scenarios [3]. Since 2014, the advanced development in deep learning technology, particularly of convolutional neural networks (CNN), has significantly improved the state of the art in object detection [4]. Deep learning technology can self-learn and automatically extract feature information from images, with stability and efficiency in fruit detection and recognition [5, 6]. Deep learning-based fruit detection algorithms are divided into two types: region-based models and regression-based models. The region-based models consist of two stages: generating candidate regions and extracting features from each candidate box for bounding box and classification regression tasks, which leads to a slower recognition speed, such as Fast Region-based CNN (R-CNN) [7], Faster R-CNN [8], and Mask R-CNN [9]. The regression-based object detection method abandons the stage of extracting candidate regions, and directly obtains the category probability and position of the object, resulting in a simpler network structure, such as You Only Look Once (YOLO) series [10,11,12] and Single Shot MultiBox Detector (SSD) [13]. Although some accuracy is sacrificed, the detection and recognition speed of the algorithm has been improved.

Although deep learning-based object detection algorithms have superior detection performance, they require large numbers of parameters and high computational costs, making them difficult to deploy on edge devices with limited space and computing power. As a result, the application of algorithms to real-time detection and efficient sorting of fruits is challenging [14, 15]. Due to the lack of public detection datasets for coarse-fine variety fruits, the current model cannot detect both coarse and fine variety fruits with highly similar phenotypes.

To address the problems that existing fruit detection models are too complex and cannot accurately detect coarse-fine variety fruits, this paper proposes a detection and recognition model for multiple coarse-fine variety fruits based on improved YOLOv5, which has high robustness and accuracy, low complexity, and can be applied to real-time detection, online grading and fast sorting of coarse-fine variety fruits. The main contributions are as follows:

  1. (1)

    A fruit image dataset containing 20 different fruit varieties under clean and complex backgrounds was constructed to address the lack of public detection datasets for coarse-fine variety fruits, which can provide data support for the research, optimization, and application of detection models for coarse-fine variety fruits.

  2. (2)

    To address the difficulty of complex networks in deploying on edge devices with limited computing power, a lightweight network based on GhostConv and C3Ghostv1 was proposed by introducing deep convolution on baseline detection network suitable coarse-fine variety fruits, which greatly reduced the number of parameters and computation of the model.

  3. (3)

    To address the poor accuracy of existing networks for fine variety fruit detection in complex scenes, a feature extraction network based on C3Ghostv2 was proposed to obtain rich overall features of fruits and the bounding box loss function was optimized to improve the localization of ordinary-quality anchor boxes in complex scenes.

Related works

Traditional fruit detection and recognition methods typically use manually designed methods to extract features. First, features such as size, shape, color, and texture of the fruit image are extracted, and then a classifier is built by fusing one or more of these features to achieve automatic fruit classification and recognition. Liu et al. [16] developed a machine vision algorithm based on elliptical boundary models to recognize immature and ripe grapefruit fruits on trees by converting images from RGB space to Y’ CbCr space and then fitting an implicit second-order polynomial of elliptical boundary models in Cr–Cb color space using ordinary least squares (OLS). To achieve accurate detection of litchi fruits in natural environments, Yu et al. [17] used color and texture features to train a random forest binary classification model to identify litchi fruits and proposed a ripe litchi identification method based on multiscale detection and non-maximal suppression algorithm to further improve the fruit detection accuracy. Pothen and Nuske [18] proposed a high-precision key point detection algorithm using round fruits such as grapes and apples, which determines the marquee location of fruits by intensity variation and gradient direction on the fruit surface, and uses a random forest classifier to identify fruit species. Traditional fruit detection and recognition algorithms are relatively mature, but the detection accuracy is highly dependent on the extracted features and trained classifiers, resulting in low detection rate, slow detection speed, and poor applicability for targets in complex environments.

With the improvement of computer performance, many deep learning-based object detection algorithms have been applied to fruit detection and recognition tasks, resulting in significantly improved detection performance and speed. Prakash and Prakasam [19] proposed an intelligent fruit classification system based on convolutional neural network and bilinear pooling of heterogeneous streams. Gao et al. [20] proposed a multi-class apple detection method based on improved Faster R-CNN network for the problem of reduced fruit picking efficiency due to several branches shading in fruit forests, with mean average precision (mAP) reaching 87.9% and average detection time of 0.241 s for a single image. The regression-based object detection algorithm has faster detection recognition speed and is suitable for practical application scenarios, and many improved algorithms based on it have been proposed. Mirhaji et al. [21] trained and tested different versions of the YOLO model using image datasets of orange trees under different lighting conditions, and adapted the models using a migration learning strategy, finally concluding that YOLO-V4 is the best model for orange detection. To address inaccurate cherry fruit detection due to leaf shading, Gai et al. [22] proposed an improved YOLOv4 deep learning algorithm, which increased the density between network layers to enhance feature extraction by combining DenseNet in the backbone of YOLOv4, and its average accuracy was improved by 0.15 compared to the original model. Wang et al. [23] proposed the DSE-YOLO algorithm to detect multi-stage strawberries by introducing DSE modules in the backbone to extract various details and semantic features in horizontal and vertical dimensions, which achieved mAP of 86.58% and F1 score of 81.59%. Yao et al. [24] developed a defect detection model based on YOLOv5 for kiwi flaw detection, which can detect defects accurately at a fast speed. In response to the lack of accuracy in detecting small tomato targets, Wang et al. [25] proposed an improved small mobile network YOLOv5 (SM-YOLOv5) detection algorithm based on YOLOv5 for target detection of tomato picking robots in plant factories. Ma et al. [26] proposed a lightweight fruit recognition network DGCC-Fruit based on YOLOv5 to detect fine-grained fruits in different environments.

Deep learning-based fruit detection algorithms can learn features automatically from training data and exhibit strong fruit recognition capabilities in complex environments. However, their parameters and computational complexity are too high to be deployed on edge devices for real-time fruit detection tasks. Additionally, current deep learning-based fruit detection algorithms primarily focus on coarse-variety fruit datasets in ideal environments, with weak model generalization capabilities, which cannot accurately detect multi-objective and fine-variety fruits in complex environments.

Materials and methods

Dataset

Samples in the experiments

Considering the wide cultivation area, variety diversity, species and varieties classification of fruits in botany, five coarse variety of apples, cherries, watermelons, oranges and pomelos as well as their 20 fine variety fruits were selected as experimental samples in this study, the variety information is shown in Table 1 and Appendix. Specifically, seven varieties of cherry samples were procured in Yantai and Yangling, China for this paper, namely, black pearl, red light, huang mi, lapins, rainier, tieton, and pioneer. Samples of three varieties of apples, ruiyang, ruixue, and aifei, were obtained at the Baishui Apple Experiment Station of Northwest Agriculture and Forestry University, China. Watermelon varieties include futian watermelon, dafugui watermelon, xinfunong watermelon, and chengyu watermelon. Pomelo varieties include meizhou pomelo, liangping pomelo, and liangjiang pomelo. All samples were stored in a cooler at 0–5 \(^{\circ }\)C and 85% relative humidity for up to 5 days to preserve freshness prior to the experiment.

Dataset creation

Fig. 1
figure 1

Dataset creation. a The original captured images. b The data enhanced images. c The image annotation process

Table 1 Variety of samples
Table 2 Camera parameters
Table 3 Details of the self-made dataset

JPG images of 5 coarse variety fruits and their 20 fine variety fruits were collected using a motorized turntable and a normal camera in a simple background. The camera parameters are shown in Table 2. The acquisition process consisted of three steps:

  1. (1)

    Fixed the camera position, placed the fruit samples in the center of the motorized turntable, and adjusted the initial pose of each sample to neck facing upward.

  2. (2)

    Controlled the motorized turntable to rotate 360° clockwise and took pictures of the samples every 30° to obtain pictures of the samples at multiple angles.

  3. (3)

    Adjusted the sample pose to neck facing forward, repeated the operation in (2), and a single sample could get 24 pictures in a single background.

To improve the generalization and applicability of the model, multi-angle, multi-variety, and different fruit densities images were captured in different complex scenes indoors and in the field. Finally, as shown in Fig. 1a, 13,198 fruit images with 3024*4032 pixels were get and stored in JPG format after compression.

To address insufficient data and unbalanced samples of fruit images, the fruit dataset was expanded using data enhancement methods such as flip, rotation, crop, and brightness transformation and a total of 23,198 images were obtained. The data-enhanced images are shown in Fig. 1b.

As shown in Fig. 1c, the images were manually labeled by LabelImg software. Firstly, the target fruit in the image was marked by the smallest outer rectangle and its variety was indicated, then a txt format annotation file was generated, which contained the variety of fruit, the coordinates x and y of the center point of the rectangle, width w and height h relative to the image. Finally, the dataset was divided into training set, validation set and test set in the ratio of 8:1:1, and the details of the dataset are shown in Table 3.

Algorithmic optimization

Baseline network selection

As the detection algorithm with superior detection speed and accuracy performance among the YOLO family of algorithms [27], YOLOv5 consists of four parts: input, backbone, neck, and head [28]. The training image is fed into the backbone after data enhancement, adaptive anchor frame computation and adaptive image scaling at the input. Backbone mainly consists of CBS, C3, and SPPF modules, which is used to extract feature maps at three scales. C3 module performs feature extraction on the feature maps, which reduces the repetition of the gradient information during the optimization process of convolutional neural network, thus reducing the amount of computation and guaranteeing the accuracy rate. Neck fuses the features of the backbone output. By using a structure that combines Feature Pyramid Networks (FPN) and Perceptual Adversarial Networks (PAN), it fully fuses the high-level semantic features and the low-level localization features. Head generates bounding boxes and predicts varieties using loss functions and Non-Maximum Suppression (NMS).

Depth_multiple is used as a scaling factor for the residual blocks in C3 and controls the depth of the network. Meanwhile, width_multiple is used as a scaling factor for the number of channels, controlling the width of the network. As shown in Table 4, YOLOv5 is divided into YOLOv5n, v5s, v5m, and v5l according to depth_multiple and width_multiple. It can be seen that the YOLOv5n network has the simplest structure, the smallest network depth, and the fewest parameters.The YOLOv5l network has the most complex structure, the deepest network, and the most parameters. While complex networks can achieve better detection accuracy, they require more parameters and computational costs, and take longer to train and detect. To detect coarse-fine variety fruits accurately and quickly, YOLOv5s was chosen as the baseline network, which can maintain a balance between detection performance and model complexity.

Table 4 YOLOv5 network with different scaling factor

Algorithm improvement

Fig. 2
figure 2

Proposed network based on YOLOv5s

Deep learning-based object detection algorithms require high computational resources, which result in limit application to real-time detection and sorting of fruits in real operations due to hardware conditions. To facilitate the application of fruit detection networks to practical operations, this paper proposed a low-complexity and high-precision fruit detection network based on YOLOv5s. First, the lightweight module—C3Ghostv1 was constructed by introducing the lightweight convolution—GhostConv, and further the lightweight network structure based on GhostConv, C3Ghostv1 was proposed to reduce the model complexity; second, the C3Ghostv2 module was introduced into the backbone to enable the deep convolution to extract rich overall target features in higher dimensions and improve the performance for detecting phenotypically similar fruit varieties, which expands the input channels of residual structure in C3Ghostv1; Finally, Wise-Intersection over Union (IoU) loss function with dynamic non-monotonic focusing mechanism was introduced to improve the detection and generalization performance of the model for multi-target and fine-variety fruits in complex environments. The improved network structure is shown in Fig. 2.

  1. (1)

    Network structure lightweighting

In the process of extracting fruit features from a neural network, many feature maps with high similarity are generated, which usually ensure a comprehensive understanding of the input data and have an important impact on the performance of the network [29]. Compared to depthwise convolution, standard convolution,which is utilized to generate numerous similar and redundant feature maps in YOLOv5, requires more parameters and computation, resulting in difficulty in deployment on edge devices. Therefore, a lightweight network based on GhostConv and C3Ghostv1 was proposed to generate fruit redundant feature maps, which reduced the model complexity by introducing depthwise convolution.

Fig. 3
figure 3

Structure of GhostConv module

GhostConv Module As shown in Fig. 3, GhostConv first uses \(1\times 1\) standard convolution to generate part of the intrinsic feature maps, then uses \(5\times 5\) depthwise convolution to generate the “ghost” feature maps of the intrinsic feature maps, and finally superimposes the intrinsic feature maps and the “ghost” feature maps on the channels to obtain the output feature maps.

Assuming that the input feature map size is \(h\cdot w\cdot c\), the convolution kernel size is \(k\cdot k\), and the output is \({h}''\cdot {w}''\cdot {{c}_{2}}\), the Computational volume \({{S}_{1}}\) and params \({{P}_{1}}\) of the standard convolution are shown in Eqs. 1 and 2, respectively.

$$\begin{aligned} {{S}_{\text {1}}}={{c}_{2}}\cdot {h}'\cdot {w}''\cdot {{c}_{1}}\cdot k\cdot k \end{aligned}$$
(1)
$$\begin{aligned} {{P}_{1}}={{c}_{2}}\cdot {{c}_{1}}\cdot k\cdot k \end{aligned}$$
(2)

In GhostConv, 1/2\({{c}_{2}}\) intrinsic feature maps are obtained by standard convolution, and the same number of “ghost” feature maps are obtained by depthwise convolution. Assuming that the size of the convolution kernel for deep convolution is \(d\cdot d\), the computational volume \({{S}_{2}}\) and parameters \({{P}_{2}}\) are shown in Eqs. 3 and 4, respectively.

$$\begin{aligned} {{S}_{2}}=\frac{1}{2}{{c}_{2}}\cdot {h}'\cdot {w}'\cdot \left( {{c}_{1}}\cdot k\cdot k+d\cdot d \right) \end{aligned}$$
(3)
$$\begin{aligned} {{P}_{2}}=\frac{1}{2}{{c}_{2}}\cdot \left( {{c}_{1}}\cdot k\cdot k+d\cdot d \right) \end{aligned}$$
(4)

If k \(\approx\) d, according to Eqs. 14, the theoretical acceleration ratio of standard convolution and GhostConv \({{r}_{\text {s}}}\) is shown in Eq. 5, and the parameter compression ratio \({{r}_{p}}\) is shown in Eq. 6.

$$\begin{aligned} {{r}_{s}}=\frac{{{S}_{1}}}{{{S}_{2}}}=\frac{{{c}_{1}}\cdot k\cdot k}{\frac{1}{2}\cdot {{c}_{1}}\cdot k\cdot k+\frac{1}{2}d\cdot d}\approx 2 \end{aligned}$$
(5)
$$\begin{aligned} {{r}_{p}}=\frac{{{P}_{1}}}{{{P}_{2}}}\approx 2 \end{aligned}$$
(6)

As shown in Eqs. 5 and 6, the introduction of GhostConv in the network can theoretically save 1/2 of training time and reduce 1/2 of parameters, which facilitates the deployment of the model to accomplish the task of real-time fruit detection and fast sorting in real operations.

Fig. 4
figure 4

Structure of C3Ghostv1 module

C3Ghostv1 Module As shown in Fig. 4, C3Ghostv1 uses two branches to process the input fruit feature maps in parallel. One branch uses standard convolution to halve the input feature channels and extract low-level fruit features. The other branch reduces the dimension of input feature channels with 1\(\times\)1 standard convolution, and then extracts high-level abstract features of fruits through multiple GhostBottlenecks. As the basic residual unit, GhostBottleneck first processes the input fruit feature map with two stacked GhostConv modules, where the first GhostConv module does not use the SiLU activation function to avoid the loss of fruit information due to the nonlinear activation function [29]; then the result is associated with the input feature map through residual connections to obtain the output. By introducing GhostConv, GhostBottleneck can alleviate gradient vanishing caused by deepening the network at a lower cost, which is beneficial for the network to extract more complex fruit features to identify fruits with small phenotypic differences. Finally, the results of the two branches are stacked on the channel dimension.

C3Ghostv1 preserves the reuse of fruit features by adopting a hierarchical feature fusion strategy, while avoiding excessive repetitive gradient information by truncating the gradient flow, thus ensuring the model’s ability to learn different fruit features and reducing the network’s parameters and computation, accelerating the training and inference speed.

  1. (2)

    Enhancement of network feature extraction capability

In depthwise convolution, the channel information of the input image is separated during the calculation process, making it impossible to obtain the associated information of different channels at the same spatial position and limiting the extracted features to the input feature dimensions [30]. Although the dimensionality reduction in C3Ghostv1 can alleviate the complexity of the network, a limited number of features extracted by the depthwise convolution at a lower dimension makes it difficult to extract rich overall fruits feature in backbone, which is not conducive to identifying fruits with high similarities.

Fig. 5
figure 5

Structure of C3Ghostv2 module

To improve the feature extraction capability of the network, C3Ghostv2 module was designed in the backbone inspired by the idea of MobileNetv2 [30], which enabled depthwise convolution to capture rich overall features of different fruits at higher dimensions. As shown in Fig. 5, this module uses two parallel branches to process the input features. One branch uses 1*1 convolution to increase the channel dimension before the GhostBottleneck, doubling the number of channels. The other branch uses standard convolution to extract shallow features. Finally, a 1*1 convolution is used to reduce the dimension of the combined feature map from both branches, ensuring that the input and output have the same number of channels. By expanding the input channels of GhostBottleneck, C3Ghostv2 enables depthwise convolution to extract more features and reduce feature loss, which facilitates the network to extract comprehensive information of different fruit varieties, thereby improving the detection performance of the model on different fruit variety and reducing false detection rates.

  1. (3)

    Optimization of bounding box regression loss function

YOLOv5 constructs a loss function weighted by bounding box regression (BBR) loss, classification loss, and objectness loss, where the bounding box regression loss directly determines the localization performance of the model. YOLOv5 adopts the Complete-IoU [31] bounding box loss function, which includes two penalty terms added to IoU loss [32]: normalized distance and aspect ratio between anchor boxes and target boxes. However, its lack of a focusing mechanism for accurate localization of ordinary-quality anchor boxes results in high-quality anchor boxes, ordinary-quality anchor boxes, and low-quality anchor boxes contributing equally to the loss, which limits the improvement of detection performance for multi-object fruits in complex scenes. Therefore, Wise-IoU [33] bounding box loss function was introduced to effectively improve the model’s detection performance and generalization ability for fruits in complex scenes, which utilized a dynamic non-monotonic focusing mechanism to reduce the competitiveness of high-quality anchor boxes and mitigate the harmful gradients generated by low-quality anchor boxes.

Fig. 6
figure 6

Schematic diagram of the anchor and target boxes. \(\left( x,y \right)\) is the coordinate of the center point of the anchor box. \(\left( {{x}_{gt}},{{y}_{gt}} \right)\) is the coordinates of the center point of the target box. \({{W}_{i}}\) and \({{H}_{i}}\) are the length and width of the overlapping rectangular area of the anchor box and the target box respectively. \({{W}_{g}}\) and \({{H}_{g}}\) are the length and width of the minimum enclosing box of the anchor box and the target box, respectively

As shown in Fig. 6, for the anchor box \(B=\left[ x\,y\,w\,h \right]\), x and y correspond to the center coordinates of the bounding box, and w and h represent the width and height of the bounding box. Similarly, \({{B}_{gt}}=\left[ {{x}_{gt}}\ {{y}_{gt}}\ {{w}_{gt}}\ {{h}_{gt}} \right]\) describes the properties of the target box.

First, Wise-IoU v1 constructs a two-layer attention-based bounding box loss as defined in Eq. 7.

$$\begin{aligned} {{{\mathcal {L}}}_{WI\text {o}U v1}}={{{\mathcal {R}}}_{WIoU}}{{{\mathcal {L}}}_{IoU}} \end{aligned}$$
(7)

Where \({{{\mathcal {R}}}_{WIoU}}\in \left[ 1,e \right)\) is the penalty term for the distance between the center points of the anchor box and the target box, which significantly amplifies the IoU loss of the ordinary-quality anchor box, as shown in Eq. 8. \({{{\mathcal {L}}}_{IoU}}\in \left[ 0,1 \right]\) is used to measure the overlap between the anchor box and the target box, which reduces the contribution of the penalty term of the high-quality anchor box to the loss, as shown in Eq. 9. Thus, the loss optimization focuses on the ordinary-quality anchor box, which is beneficial for the localization of multi-target, hard-to-detect fruits under the interference of other objects in complex scenes.

$$\begin{aligned} {{{\mathcal {R}}}_{WIoU}}=\exp \left( \frac{{{\left( x-{{x}_{gt}} \right) }^{2}}+{{\left( y-{{y}_{gt}} \right) }^{2}}}{{{\left( W_{g}^{2}+H_{g}^{2} \right) }^{*}}} \right) \end{aligned}$$
(8)
$$\begin{aligned} {{{\mathcal {L}}}_{IoU}}=1-IoU=1-\frac{{{W}_{i}}{{H}_{i}}}{wh+{{w}_{gt}}{{h}_{gt}}-{{W}_{i}}{{H}_{i}}} \end{aligned}$$
(9)

Then, as shown in Eq. 10, Wise-IoU v3 utilizes outlier degree \(\beta\) to construct a non-monotonic focusing coefficient r, which is then applied to Wise-IoU v1.

$$\begin{aligned} \begin{aligned} {{{\mathcal {L}}}_{WIoUv3}}=r{{{\mathcal {L}}}_{WIoUv1}},r=\frac{\beta }{\delta {{\alpha }^{\beta -\delta }}} \\ \beta =\frac{{\mathcal {L}}_{IoU}^{*}}{\overline{{{{\mathcal {L}}}_{IoU}}}}\in \left[ 0,+\infty \right) \end{aligned} \end{aligned}$$
(10)

\(\alpha\) and \(\delta\) are hyperparameters, and outlier degree \(\beta\) describes the quality of anchor boxes. A small outlier degree means a high-quality anchor box, to which a small gradient gain r is assigned, so that the bounding box regression focuses on the ordinary-quality anchor box, further enhancing the localization of fruits in complex scenes. Additionally, a smaller gradient gain is assigned to anchor boxes with higher outlier degree, effectively preventing low-quality examples from generating significant harmful gradients, thus improving the model’s generalization performance.

Experiments and discussion

Experimental setup

The operating system used for all experiments in this paper was Ubuntu 16.04 LTS. The CPU model was Intel Xeon Silver 4210 with a clock frequency of 2.20 GHz and 64 GB of RAM. The GPU model was GeForce RTX 2080 Ti with 11 GB of VRAM and 125 GB of memory. The model training environment was based on the PyTorch deep learning framework and utilized Python 3.8 as the programming language. CUDA 10.2 and CUDNN 8.2.1 were used to accelerate the GPU. To enable a fair comparison between the results of all the experimental configurations, the hyper-parameters for the YOLO-based models were standardized. The input images in the network were set to 640Ã-640 pixels, and the batch size was set to 32. Moreover, the models were trained for 150 epochs with an initial learning rate of 0.001 and a weight decay of 0.0005.

Evaluation metric

In this paper, precision, recall, and average precision (AP) were used as evaluation metrics for model detection accuracy. Precision is defined as the ratio of correctly predicted positive samples by the model to all the samples predicted as positive. Recall indicates the proportion of positive samples correctly identified as positive by the model out of all positive samples. It measures the ability of the model to correctly recognize positive samples. Average precision (AP) is defined as the area under the precision and recall curve at various detection thresholds. A higher AP value indicates better detection performance of the model. Mean average precision (mAP) is defined as the average AP across multiple object categories, measuring the model’s detection performance for all object categories. Precision, recall, average accuracy, mean average precision are computed using Eqs. 1114:

$$\begin{aligned} \Pr =\frac{TP}{TP+FP} \end{aligned}$$
(11)
$$\begin{aligned} Rc=\frac{TP}{TP+FN} \end{aligned}$$
(12)
$$\begin{aligned} AP=\int \limits _{0}^{1}{\Pr \left( Rc \right) dRc} \end{aligned}$$
(13)
$$\begin{aligned} mAP=\frac{1}{c}\sum \limits _{i=1}^{c}{A{{P}_{i}}} \end{aligned}$$
(14)

Where true positive (TP) is the true classification result where fruits are correctly detected with IoU area over 0.5 threshold. False positive (FP) refers to the falsely detected fruits, while missed detected fruits are denoted as false negative (FN)

In addition, the lightness of the network model was measured in terms of floating-point operations (FLOPs), the number of model parameters and model size. FLOPs refer to the amount of calculations during the forward propagation of the network and are used to evaluate the computational complexity of the model.

Experimental results

Comparative analysis of different YOLOv5 models on self-made dataset

To select a suitable baseline model, YOLOv5 models of different depths and widths were trained on self-made fruit dataset, and the results are shown in Fig. 7. The training and validation loss charts show that complex network models have lower bounding box regression loss, objectness loss, and classification loss on the fruit training set. However, their objectness loss increases on the validation set in the later stages of training, such as YOLOv5m and YOLOv5l, implying that the models appear to be overfitted. Additionally, the mAP charts for each model demonstrate that the complex models do not show significant improvement in fruit detection performance.

Fig. 7
figure 7

Training results of different models of YOLOv5. ac Are the bounding box regression loss, objectness loss and classification loss on the training set, respectively. df Are the bounding box regression loss, objectness loss and classification loss on the validation set, respectively. g Denotes the mAP at 0.5 IoU, and h is the mean value of mAP for different IoU thresholds (from 0.5 to 0.95 with a step size of 0.05)

The model complexity is characterized by the FLOPs, parameters and model size. The more complex the model, the larger the FLOPs, number of parameters and model size. The detection performance and complexity of various models are shown in Table 5. YOLOv5l exhibits the best detection performance for fruits, achieving an mAP of 93.3% at 0.5 IoU(mAP@.5), and an mAP of 84.8% at IoU = 0.5:0.95(mAP@.5:.95). However, it is worth that YOLOv5l is the most complex model among the options with model size of 93 MB, 46.21M parameters and 108 GFLOPs, which is not conducive to deployment on the edge devices for real-time fruit detection and online sorting. Moreover, Although YOLOv5n has the most simplified network architecture, its performance for fruit detection has a large gap compared with other models. The model size of YOLOv5s is 14.5 MB, the parameters are 7.06 M, and the floating-point operations are 16.1GFLOPs. Compared with YOLOv5l, the size of YOLOv5s is simplified by 84.41%, and the model computation is reduced by 85.10%. However, there is only a small decrease in detection performance, with a decrease of only 0.3% in mAP@0.5 and a decrease of 1.8% in mAP@0.5:0.95. In summary, YOLOv5s is more suitable for the detection of multiple varieties fruits in clean and complex backgrounds, so YOLO-v5s was used as the baseline for improvement in this paper.

Table 5 Detection performance and complexity parameters of different models of YOLOv5

Results of ablation experiments

The variation of the loss of each improved model during the training process is shown in Fig. 8. Compared with the baseline, the loss of the model on the training set as well as the validation set converged to a larger value after the introduction of C3Ghostv1, while the introduction of C3Ghostv2 in the backbone alleviated the increased loss values caused by the C3Ghostv1 module. In addition, the introduction of the Wise-IoU loss function in the baseline not only accelerated the convergence of the model on the validation set, but also reduced the loss of the model, indicating the effectiveness of the Wise-IoU loss function on model performance improvement. The introduction of C3Ghostv1, C3Ghostv2, and the Wise-IoU loss function in YOLOv5s showed comparable losses to the baseline on the training set. However, it exhibited superior performance on the validation set.

Fig. 8
figure 8

Losses of different improved algorithms during training, loss is the sum of boxl_oss, obj_loss and cls_loss. a The loss variation of the training set during training. b The loss variation of the validation set during training

Fig. 9
figure 9

Mean average accuracy variation of different improved algorithms during the training process. a The mean average precision at 0.5 IoU(mAP@.5). b The mean average precision at IoU = 0.5:0.95(mAP@.5:.95)

The mean average precision change of each improved model during the training process is shown in Fig. 9. The mAP0.5 and mAP0.5:0.95 of the model decreased after the introduction of C3Ghostv1, which were improved and slightly above the baseline after further introduction of C3Ghostv2 in backbone. Finally, the mAP was further enhanced by introducing the Wise-IoU bounding box loss function.

The results of ablation experiments on the test set are shown in Table 6. Compared with the baseline, the accuracy and recall of the model decreased by 0.3% and 0.4%, respectively, and both mAP@.5 and mAP@.5:.95 decrease by 0.4% after introducing C3Ghostv1. In addition, parameters decreased by 48%, the computation volume decreased by 49%, and the model size decreased by 46%. The above indicates that the C3Ghostv1 module reduces the detection performance although it can greatly reduce the model complexity. After further introducing the C3Ghostv2 module in the backbone of the network, the accuracy and mAP@.5:.95 both improved by 0.3% compared with the baseline model, which were 0.6% and 0.7% higher than the introduction of C3Ghostv1 alone. Furthermore, there was only a slight increase in parameters and model size. The above shows that expanding the input channels of deep convolution can enhance the network’s ability to extract features and improve the detection performance of the model without explosive increase in the model size and computational effort. After introducing Wise-IoU in the baseline, the complexity of the model remained unchanged, while precision and recall both improved by 0.4%. Moreover, the mAP0.5 and mAP@.5:.95 increased by 0.5% and 0.9% respectively. This indicates the effectiveness of the Wise-IoU loss function in multi-object fruit detection under complex backgrounds. The results show that introducing C3Ghostv1, C3Ghostv2, and the Wise-IoU loss function in the baseline achieves the best balance between detection performance as well as model complexity. Compared with the baseline, the accuracy, mAP@.5 and mAP@.5:.95 increased by 0.5%, 0.6% and 1.2%, respectively. In addition, the parameters decreased by 32%, the computational effort decreased by 1.2 GFLOPs, and the model size decreased by 33%.

Table 6 Results of ablation experiments
Fig. 10
figure 10

Detection in clean background

Fig. 11
figure 11

Detection in complex scenes

Test image visualization

To investigate the reliability of the proposed model, the detection of photos in the test set was performed using the model. Figs. 10 and 11 show the detection results of several images in clean background and complex environment, respectively. The experimental results show that the proposed model can accurately identify the coarse-fine variety while detecting the fruit location, thereby achieving the task of fruit detection and recognition in a variety of environments.

Comparison of object detection algorithms on the self-made dataset

To verify the effectiveness of the proposed model, several target detection algorithms, such as Faster R-CNN, Single Shot MultiBox Detector (SSD), YOLOv6, YOLOv7, YOLOv8, etc., were trained on the self-made dataset, and the results are compared in Table 7. It can be observed that the proposed network has the best detection performance with mAP0.5 of 93.6% and mAP0.5:0.95 of 84.2%. Additionally, it has a relatively low complexity with a model size of only 9.9 MB and 4.71 M parameters, which is lower than most single-stage detection algorithms and significantly lower than two-stage object detection algorithm, such as Faster R-CNN. The results show that the proposed network achieves the reasonable balance between detection performance and model complexity.

Table 7 Comparison of object detection algorithms on the self-made dataset
Table 8 Comparison of object detection algorithms on public dataset

Comparison of the results of different object detection algorithms on public dataset

As a standard dataset to measure the network’s ability to detect and classify images, PASCAL VOC 2007 provides 20 kinds of images in different contexts and contains a training set, a validation set and a test set.The training and validation sets contain 5011 images and the test set contains 4952 images. To verify the generalization performance of the proposed model, as shown in Table 8, the model was compared with several other popular detection networks on the public dataset VOC2007. Each model was trained without pre-training weights to eliminate the influence of pre-training weights on the results. It can be found that the two-stage detector Faster R-CNN has the best detection performance with mAP0.5 of 66.5%, but its network is the most complex with a model size of 297.83 MB, which is 30 times larger than the proposed network. The proposed algorithm has the best detection performance in single-stage detectors with mAP0.5 of 63.8% and mAP0.5:0.95 of 38.2%, and it has an advantage in model complexity with a model size of only 9.9 MB and floating-point operations of 14.6 GFLOPs, which achieves an optimal balance between the detection performance and model complexity.

Conclusion

To address the problems of excessive complexity of current fruit detection models and the inability to accurately detect fine variety fruits in complex scenes, this paper proposed a lightweight and high-precision fruit detection model based on a single-stage target detection network YOLOv5 with a self-made fruit dataset. The main findings are as follows: (1) Through image acquisition, manual annotation, dataset division and enhancement, an object detection dataset containing 20 varieties of fruits in clean and complex backgrounds was constructed, solving the current problem of missing public datasets for fine variety fruits detection. (2) By introducing deep convolution, a lightweight network structure based on GhostConv and C3Ghostv1 was proposed with 4.4 M parameters, size of 9.9 MB, and computational volume of 14.9 GFLOPs, which solved the problem of excessive complexity of existing networks and provided support for model deployment in edge devices with limited space and computational resources. (3) By introducing C3Ghostv2 module and Wise-IoU loss function, the mAP@.5 of the model reached 93.6% and mAP@.5:.95 reached 84.2%, which solved the problem of the existing network in low accuracy of coarse-fine variety fruits detection in complex environment and satisfied real-time detection, online grading and fast sorting of many kinds of fruits in precision agriculture.

Although the generalization performance has been validated on the public dataset PASCAL VOC2007, the proposed model has limitations in outdoor complex fields and detectable fruit varieties. In the future, the dataset can be expanded by adding multiple fruit varieties and complex field scenarios to improve the generalizability of the model.