Introduction

The automatic detection of fruit is of great importance for a wide range of applications in the agriculture field. Robots equipped with automatic fruit detection can detect and locate products to harvest automatically, which may significantly reduce labor tensions in countries such as China (Xiang et al. 2014; Zou et al. 2012, 2016; Wang et al. 2016). Automatic fruit detection may also assist in smart quality monitoring during the fruit growing process, as the appearance of the product is a valuable cue—such information can help farmers make useful decisions regarding fertilization or harvesting (Herold et al. 2005). It can also provide appropriate number of fruits in an orchard, which can be used to estimate yield as a necessary step to plan the harvest schedule (Font et al. 2014).

There currently exists a wide variety of color-based fruit detection methods, but challenges, such as low contrast between fruits and leaves, unstable illumination condition and occlusion, probably render them ineffective (Luo et al. 2016; Barnea et al. 2016). As the contour of an object is invariant to color and lighting conditions and remains stable to intra-category shape variation, contour is often employed for object detection (Murillo-Bracamontes et al. 2012; Lu and Sang 2015; Li et al. 2016). The aim of this study was to exploit contour rather than color information for automatic fruit detection. Figure 1 shows the contours marked in different colors. As can be seen, there are two issues to contour-based methods: (1) target fruit contours are often broken into inconsistent fragments which may be connected to non-target fragments. An example is given in Fig. 1c, where the edge fragment of each citrus either contains part of the contour of the leaf or shadow on the fruit surface. (2) A huge proportion of the input image is occupied by non-target fragments, which thus makes valid detection extremely difficult (see Fig. 1d). Inspired by Donoser et al. (2009) who presented partial shape matching (PSM) to identify parts of an edge fragment matching parts of a given reference contour from cluttered background, PSM is applied to detect sub-fragments of interest, then a novel probabilistic Hough transform (PHT) is investigated to aggregate these sub-fragments to obtain fruit candidates, and finally a support vector machine (SVM) classifier is built to exclude false candidates.

Fig. 1
figure 1

Illustration of the challenges in fruit detection. a Citrus image. b Bitter gourd image. c, d Contours obtained from a and b, respectively, and marked in different colors

Related work

Automatic fruit detection with color cues has been extensively studied over recent decades (Wang et al. 2013; Payne et al. 2014; Bargoti and Underwood 2017; Puttemans et al. 2016; Zhao et al. 2016). Interested readers are recommended to read recent reviews by Kapach et al. (2012) and Gongal et al. (2015). Here, the background material of sub-fragment detection and aggregation, and fruit classification is discussed.

Sub-fragment detection and aggregation

PSM is a commonly used, generalized algorithm (Donoser et al. 2009) to detect sub-fragments matching parts of a given reference contour. Riemenschneider et al. (2010) invented an angular descriptor to represent geometrical properties of the fragment and applied this descriptor to PSM to detect candidate sub-fragments, followed by employing PHT to group these sub-fragments into objects. Ma and Latecki (2011) first utilized lengths and orientations cues to increase the discrimination power of the descriptor, then used PSM to obtain sub-fragments of interest, and finally aggregated these sub-fragments into possible objects by finding maximum weight subgraphs in a graph using the graph clustering method proposed by Liu et al. (2010). Such a length-based descriptor is not invariant to scale change, thus limiting the application of this method. Fan et al. (2015) proposed a variation of PSM by breaking both the edge fragments and reference contour into shorter and continuous sub-fragments between any two neighboring curvature extreme points (CEP), then calculating the similarity between each pair of sub-fragments, and selecting the top 10% best matches as final results. PHT was used to merge these sub-fragments. Unfortunately, this method is not suited to circular fruits as no CEPs exist on their contours.

Generative model, which leverages a learned reference contour, is another popular method (Fergus et al. 2003; Opelt et al. 2006; Lu et al. 2009). Shotton et al. (2008) first learned parts of the reference contour using k-medoids from training samples, then detected them in the input image by applying the oriented chamfer matching algorithm (OCM), and finally used a cascaded classifier to combine fragments. Yu et al. (2017) also utilized OCM to locate sub-fragments of interest and formulated the sub-fragment aggregation problem as minimizing a Markov random field energy solved by the loopy belief propagation algorithm (Pearl 1982). However, fruit contours are too simple to be learned. To avoid complex learning, Su et al. (2015) directly divided a template into overlapped parts, maximized the posterior probability for the fragments in the image and the locations of object center given parts of a template to obtain sub-fragments of interest, and finally applied dynamic programming to assemble these sub-fragments. This method was effective, though the length and overlapping area of the parts must be determined carefully.

Alternative approaches utilize color or texture features of the target fruit to extract fragments of interest, and use circular fitting or circular Hough transform (CHT) to obtain fruit targets (Lu and Sang 2015; Linker 2017; Lu et al. 2018). Lu et al. (2018) applied CHT on hierarchical contours of an input image to detect citrus fruits. Linker et al. (2012) used a k-nearest neighbor (KNN) classifier to detect seed areas, segmented the contours of the seed areas into the arc and non-arc sub-fragments, then grouped adjacent arc sub-fragments into apple targets. This method was modified by Qureshi et al. (2016) to detect mango fruits. Circular models render these methods only applicable, naturally, to circular fruits.

In this research, a generalized algorithm that could deal with most type of fruits, such as circular, and non-circular, in the natural environment was investigated. PSM is first utilized to detect sub-fragments of interest. PHT, which has been well-established and successfully applied in previous studies (Maji and Malik 2009; Ommer and Malik 2009), is then used to assemble candidate sub-fragments.

Fruit classification

A classifier can be used to remove false positives after fragment aggregation. Several methods have been investigated to classify fruits, such as K-means clustering (Wachs et al. 2010), Bayesian classifier (Slaughter and Harrell 1989), KNN clustering (Kurtulmus et al. 2014), AdaBoost classifier (Luo et al. 2016), and SVM (Ji et al. 2012). Ji et al. (2012) had demonstrated the advantage of SVM in classification, therefore SVM is used in the final verification stage.

Materials and methods

Image acquisition

The datasets encompassed six categories of fruits: citrus, tomato, pumpkin, bitter gourd, towel gourd and mango, and comprised a total of 450 images (Table 1). Citrus images were gathered from an orchard in Guangzhou, China on December 2016, tomato, pumpkin, bitter gourd and towel gourd images were taken from the Nansha Base at the Guangzhou Academy of Agricultural Science on November 2017, and mango images were captured in an orchard located in South China Agricultural University, China, by using a consumer digital camera (DSC-W800, Sony Inc., Japan). All images were saved as JPG format at a resolution of 5184 × 3888 pixels. The imresize function with a bilinear interpolation method in MATLAB was performed to resize the images to 480 × 360 pixels to improve the computational efficiency. Ground-truth bounding boxes for each image were provided manually. A reference model for each category was also supplied (see Fig. 2). Examples of the datasets are shown in Fig. 3, and the acquisition scene of the datasets is shown in Fig. 4.

Table 1 The proposed datasets
Fig. 2
figure 2

Reference models. af are reference models of citrus, tomato, pumpkin, bitter gourd, towel gourd and mango, respectively

Fig. 3
figure 3

Examples of the datasets

Fig. 4
figure 4

Acquisition scene of our datasets

Algorithm overview

The flow chart of the proposed algorithm is shown in Fig. 5, which includes four steps. Edges of an input image are first detected by the gPb detector (Maire et al. 2008) and grouped to form fragments (Kovesi. 2000). Then, a novel shape descriptor is presented to measure the similarity between arbitrary pairs of fragments and applied to PSM for detecting candidate sub-fragments that are similar to parts of a given reference contour. Thirdly, a novel PHT is investigated to aggregate these sub-fragments, thus obtaining fruit candidates. Finally, an SVM classifier trained on color and texture features is used to remove false positives. More details are provided in the following subsection.

Fig. 5
figure 5

The proposed contour-based fruit detection method

It should be noted that as gPb outperforms most edge detectors, such as Canny, Prewitt, and Sobel, in terms of precision and recall (Arbelaez et al. 2011), gPb is used to extract edges from the input image.

Partial shape matching

The aim is to detect candidate sub-fragments that are similar to parts of a reference contour. A rotation and scale invariant shape descriptor is first presented to capture the geometric property of a given fragment and used to measure the similarity between any pair of fragments. This descriptor is then applied to PSM to detect candidate sub-fragments.

Shape descriptor

Let f denote an edge fragment with a sequence of edge pixels \(\varvec{f} = \left\{ {\varvec{p}_{1} ,\varvec{p}_{2} , \ldots ,\varvec{p}_{N} } \right\}\), where pi= (xi, yi)T is the i’th edge pixel on the fragment and N is the number of edge pixels on the fragment. Edge fragment f is smoothed by convolving with a Gaussian kernel with standard deviation σ (He and Yung 2008), which was set to 3 in the experiments. Next, the tangent vector ti and normal vector ni of each edge pixel pi are calculated as follows:

$$\left\{ {\begin{array}{*{20}c} {\varvec{t}_{i} = \left( {\Delta x_{i} ,\Delta y_{i} } \right)^{\text{T}} /\sqrt {\left( {\Delta y_{i} } \right)^{2} + \left( {\Delta x_{i} } \right)^{2} } } \\ {\varvec{n}_{i} = \left( { -\Delta y_{i} ,\Delta x_{i} } \right)^{\text{T}} /\sqrt {\left( {\Delta y_{i} } \right)^{2} + \left( {\Delta x_{i} } \right)^{2} } } \\ \end{array} } \right.$$
(1)

where ∆xi, ∆yi are the first-order partial derivatives in the x and y directions, respectively, and can be approximated by finite differences. Based on the tangent and normal vector cues, two N × N matrixes, distance matrix D(f) and angle matrix θ(f), are defined and used as descriptors to encode the geometric structure of fragment f. The element of the ith row and jth column of distance matrix D(f) is defined as the distance from edge pixel pi to the tangent vector of edge pixel pj (Fig. 6):

$$\varvec{D}\left( \varvec{f} \right)_{ij} = \frac{{\left| {\left( {\varvec{p}_{i} - \varvec{p}_{j} } \right)^{\text{T}} \varvec{n}_{j} } \right|}}{{\varvec{p}_{i} - \varvec{p}_{j2} }}$$
(2)

where i, j = 1,…, N. As depicted in Eq. (2), because the distance descriptor here utilizes the distance and normal vector properties of a fragment, it is intuitively more discriminative than the descriptors developed by Ma and Latecki (2011) and Fan et al. (2015) that only used the distance property.

Fig. 6
figure 6

Geometric structure of edge fragment

The element of the ith row and jth column of angle matrix θ(f) is defined as:

$$\varvec{\theta}\left( \varvec{f} \right)_{ij} = min\left\{ {\theta_{ij} ,\theta_{ji} } \right\}$$
(3)

where θij ∈ [0, 2π] refers to the relative angle at which tangent vector tj is rotated anticlockwise to be parallel to vector (pipj) (Tico and Kuosmanen 2003). Because θij considers the signed angle between two orientations, it is more robust than that of Ma and Latecki (2011) which directly uses the absolute angle of the orientations. Another advantage is that θ(f) is invariant to scale and rotation.

Given another edge fragment of length N, denoted as g, an N × N affinity matrix measuring the similarities between the corresponding pairs of elements of the two descriptors is drawn as follows (Ma and Latecki 2011):

$$\varvec{A}\left( {\varvec{f},\varvec{g}} \right) = exp\left( { - \frac{{\left( {\varvec{D}\left( \varvec{f} \right) - \varvec{D}\left( \varvec{g} \right)} \right)^{2} }}{{\sigma_{D}^{2} }}} \right) + { \exp }\left( { - \frac{{\left( {\varvec{\theta}\left( \varvec{f} \right) -\varvec{\theta}\left( \varvec{g} \right)} \right)^{2} }}{{\sigma_{\theta }^{2} }}} \right)$$
(4)

where σD and σθ are hyper-parameters, which not only play the role of normalization, but also can be used to balance the contributions of the distance and angle descriptors. These two values were set to 0.2 and π/6, respectively, in the experiments. The similarity between edge fragments f and g is defined as:

$$s\left( {\varvec{f},\varvec{g}} \right) = \frac{1}{{N^{2} }}\mathop \sum \limits_{i = 1}^{N} \mathop \sum \limits_{j = 1}^{N} \varvec{A}\left( {\varvec{f},\varvec{g}} \right)_{ij}$$
(5)

Principle of partial shape matching

The purpose of PSM is to identify parts of a fragment matching parts of a given reference contour.

Let M = {q1, q2, …, qM} denote the edge pixel sequence of a given reference contour, where M is the number of edge pixels on this contour; let E = {e1, …, eK} be the edge fragments obtained from the input image, where ek denotes the k’th edge fragment of E with a sequence of edge pixels as \(\varvec{e}^{k} = \left\{ {\varvec{p}_{1}^{k} ,\varvec{p}_{2}^{k} , \ldots ,\varvec{p}_{{N_{k} }}^{k} } \right\}\), and Nk is the number of edge pixels on the k’th edge fragment.

To determine all sub-fragments of interest, let \(\varvec{e}^{k} \left( {e,l} \right) = \left\{ {\varvec{p}_{e}^{k} , \ldots ,\varvec{p}_{e + l}^{k} } \right\} \subseteq \varvec{e}^{k}\) denote a sub-fragment of ek, where l is the length. Similarly, \(\varvec{M}\left( {m,l} \right) = \left\{ {\varvec{q}_{m} , \ldots ,\varvec{q}_{m + l} } \right\} \subseteq \varvec{M}\) denotes a sub-fragment of M. The similarity between ek(e, l) and M(m, l) can then be calculated by using Eq. (5), which yields a 3D matrix defined by Ma and Latecki (2011):

$$\varvec{\varGamma}_{1}^{k} \left( {e,m,l} \right) = s\left( {\varvec{e}^{k} \left( {e,l} \right),\varvec{M}\left( {m,l} \right)} \right)$$
(6)

Note that the edge-linking algorithm (Kovesi. 2000) arranged edge pixels into fragments from either clockwise or counterclockwise directions. Consequently, if the order of a candidate sub-fragment is different from that of the given reference contour, there may be a low similarity score in the 3D matrix. Therefore, another 3D matrix is established to resolve this problem:

$$\varvec{\varGamma}_{2}^{k} \left( {e,m,l} \right) = s\left( {\varvec{e}^{k} \left( {e, - l} \right),\varvec{M}\left( {m,l} \right)} \right)$$
(7)

These two matrixes are computed in an efficient manner using integral image (Donoser et al. 2009), which only took a few milliseconds per edge fragment in the experiments. Because triplets {e, m, l} with low similarity scores or short lengths may be related to non-target fragments, \(\varvec{\varGamma}_{1}^{k} \left( {e,m,l} \right)\) and \(\varvec{\varGamma}_{2}^{k} \left( {e,m,l} \right)\) are set to zero if their scores or lengths are below a user-defined threshold.

Any tuple {e, m} with length l, which has a large similarity value, is highly possible to belong to a valid match. Therefore, the above 3D matrices are compressed into 2D matrices along length l to maximize the similarity:

$$\left\{ {\begin{array}{*{20}c} {L_{1}^{k} \left( {e,m} \right) = \mathop {\text{argmax}}\limits_{l}\varvec{\varGamma}_{1}^{k} \left( {e,m,l} \right)} \\ {L_{2}^{k} \left( {e,m} \right) = \mathop {\text{argmax}}\limits_{l}\varvec{\varGamma}_{2}^{k} \left( {e,m,l} \right)} \\ \end{array} } \right.$$
(8)

where the value of \(L_{1}^{k} \left( {e,m} \right)\varvec{ }\) or \(L_{2}^{k} \left( {e,m} \right)\) is the length l. Based on Eq. (8), a function assigning a unique length for tuple {e, m} is defined as:

$$L^{k} \left( {e,m} \right) = \left\{ {\begin{array}{*{20}c} {L_{1}^{k} \left( {e,m} \right), if \varvec{\varGamma}_{1}^{k} \left( {e,m,L_{1}^{k} \left( {e,m} \right)} \right) \ge\varvec{\varGamma}_{2}^{k} \left( {e,m,L_{2}^{k} \left( {e,m} \right)} \right) } \\ { - L_{2}^{k} \left( {e,m} \right), else} \\ \end{array} } \right.$$
(9)

where Lk(e, m) is a positive value when the order of sub-fragment \(\varvec{e}^{k} \left( {e,l} \right)\) is the same as the order of \(\varvec{M}\left( {e,l} \right)\), otherwise it is a negative value. Function Lk can be visualized by a map, as shown in Fig. 7. This map reveals a few isolated regions in Lk, each of which contains a certain number of matches that have similar start points but different lengths. A match with greater length can average out irregularities, so the extremum l as well as its location (e, m) from each region is detected as a candidate match recorded in a list denoted as R = {{e1, m1, l1}, {e2, m2, l2},…}. Another list I with the same length as R is used to record the index of the edge fragment on which a candidate sub-fragment is located. An example of sub-fragments detected by PSM is given in Fig. 8.

Fig. 7
figure 7

Visualization of function Lk

Fig. 8
figure 8

Example illustrating candidate sub-fragments of Fig. 1c detected by PSM. Each sub-fragment is marked in random color

Probabilistic Hough transform

As candidate sub-fragments detected by PSM are scattered in the input image, and some false candidate sub-fragments may exist, a reliable fragment aggregation method to form fruit candidates is required. In this study, a new PHT was developed. Unlike traditional PHT that needs to generate codebooks from training samples (Maji and Malik 2009; Chakraborty et al. 2013), the developed PHT casts probabilistic votes directly for the locations of possible fruits based on all the observations, i.e., candidate matches.

Let S(x, O) denote the score of fruit O at location x. Assume that the candidate matches are independent, then S(x, O) can be obtained by accumulating the probabilities p(x, O,{ek, mk, lk}):

$$\begin{aligned} S\left( {\varvec{x},O} \right) & = \mathop \sum \limits_{k} p\left( {\varvec{x},O,\left\{ {e_{k} ,m_{k} ,l_{k} } \right\}} \right) = \mathop \sum \limits_{k} p\left( {\varvec{x},O,\varvec{e}^{{I\left\{ k \right\}}} \left( {e_{k} ,l_{k} } \right),\varvec{M}\left( {m_{k} ,\left| {l_{k} } \right|} \right)} \right) \\ & = \mathop \sum \limits_{k} p\left( {\varvec{e}^{{I\left\{ k \right\}}} \left( {e_{k} ,l_{k} } \right),\varvec{M}\left( {m_{k} ,\left| {l_{k} } \right|} \right)} \right) \times p\left( {\varvec{x},O|\varvec{e}^{{I\left\{ k \right\}}} \left( {e_{k} ,l_{k} } \right),\varvec{M}\left( {m_{k} ,\left| {l_{k} } \right|} \right)} \right) \\ \end{aligned}$$
(10)

As done by Maji and Malik (2009), the prior probability p(eI{k}(ek, lk), M(mk, |lk|)) is assumed to be uniformly distributed, meaning that \(p\left( {\varvec{e}^{{I\{ k\} }} \left( {e_{k} ,l_{k} } \right),\varvec{M}\left( {m_{k} ,|l_{k} |} \right)} \right) = \frac{1}{\left| I \right|}\), with \(\left| I \right|\) the number of candidate matches. In this way, S(x, O) can be reduced to:

$$\begin{aligned} S\left( {\varvec{x},O} \right) & \propto \mathop \sum \limits_{k} p\left( {\varvec{x},O|\varvec{e}^{{I\left\{ k \right\}}} \left( {e_{k} ,l_{k} } \right),\varvec{M}\left( {m_{k} ,\left| {l_{k} } \right|} \right)} \right) \\ & = \mathop \sum \limits_{k} p\left( {\varvec{x}|O,\varvec{e}^{{I\left\{ k \right\}}} \left( {e_{k} ,l_{k} } \right),\varvec{M}\left( {m_{k} ,\left| {l_{k} } \right|} \right)} \right) \times p\left( {O|\varvec{e}^{{I\left\{ k \right\}}} \left( {e_{k} ,l_{k} } \right),\varvec{M}\left( {m_{k} ,\left| {l_{k} } \right|} \right)} \right) \\ \end{aligned}$$
(11)

where p(O| eI{k}(ek, lk), M(mk, |lk|)) refers to the probability that {eI{k}(ek, lk), M(mk, |lk|)} is a true match. In the experiments, p(O| eI{k}(ek, lk), M(mk, |lk|)) was set to 1 as every candidate match {eI{k}(ek, lk), M(mk, |lk|)} was obtained quite confidently by PSM (It would be possible to let p(O| eI{k}(ek, lk), M(mk, |lk|)) reflect the confidence of a candidate match according to its length, such as p(eI{k}(ek, lk), M(mk, |lk|)) = \(\frac{{\left| {l_{k} } \right|}}{{\mathop \sum \nolimits_{j} \left| {l_{j} } \right|}}\), as a longer candidate is more likely to belong to a valid match). Thus:

$$S\left( {\varvec{x},O} \right) \propto \mathop \sum \limits_{k} p\left( {\varvec{x}|O,\varvec{e}^{{I\left\{ k \right\}}} \left( {e_{k} ,l_{k} } \right),\varvec{M}\left( {m_{k} ,\left| {l_{k} } \right|} \right)} \right)$$
(12)

where p(x|O, eI{k}(ek, lk), M(mk, |lk|)) is the probability of a fruit center location, which can be estimated by accumulating the dissimilarity between the distance from each point on an observed sub-fragment to location x and that from the corresponding point on a reference contour to the center of the reference contour. Due to the fact that the point-to-line distance is more robust than Euclidean distance (Low 2004), the following point-to-line based dissimilarity metric is used:

$$d_{k,\left| j \right|} = \log \left( {1 + \left| {\left( {\varvec{p}_{{e_{k} + j}}^{{I\left\{ k \right\}}} - \varvec{x}} \right)^{\text{T}} \varvec{n}_{{e_{k} + j}}^{{I\left\{ k \right\}}} } \right|} \right) - { \log }\left( {1 + \left| {\left( {\varvec{q}_{{m_{k} + \left| j \right|}} - \varvec{c}_{M} } \right)^{\text{T}} \varvec{n}_{{m_{k} + \left| j \right|}}^{M} } \right|} \right)$$
(13)

where dk,|j| refers to the dissimilarity between the |j|’th point pair of the k’th match; \(\varvec{p}_{{e_{k} + j}}^{{I\left\{ k \right\}}}\) and \(\varvec{n}_{{e_{k} + j}}^{{I\left\{ k \right\}}}\) are the location and normal vector of the (ek+ j)’th point on the I{k}’th edge fragment, respectively; \(\varvec{q}_{{m_{k} + \left| j \right|}}\) and \(\varvec{n}_{{m_{k} + \left| j \right|}}^{M}\) are the location and normal vector of the (mk+ |j|)’th point on the reference contour, respectively; cM is the center of the reference contour. Figure 9 shows the schematic diagram of the developed dissimilarity metric.

Fig. 9
figure 9

Schematic diagram of the developed point-to-line based dissimilarity metric, where (eI{k}(ek, lk), M(mk, |lk|)) refers to the k’th match

Based on Eq. (13), probability p(x|O, eI{k}(ek, lk), M(mk, |lk|)) is defined as follows:

$$p\left( {\varvec{x}|O,\varvec{e}^{{I\left\{ k \right\}}} \left( {e_{k} ,l_{k} } \right),\varvec{M}\left( {m_{k} ,\left| {l_{k} } \right|} \right)} \right) = { \exp }\left( { - \frac{{\varvec{d}_{k}^{\text{T}} \varvec{d}_{k} }}{{\left| {l_{k} } \right|\sigma^{2} }}} \right)$$
(14)

where \(\varvec{d}_{k} = \left( {d_{k,1} ,d_{k,2} , \ldots ,d_{{k,\left| {l_{k} } \right|}} } \right)^{\text{T}}\); \(\sigma\) is an important constant. If σ is too large, the resulting probabilities will be relatively large and similar, thus losing discrimination; if σ is too small, some true-positive matches will have low probabilities due to disturbance. In the experiments, \(\sigma\) was set to 0.1.

To minimize the computation time, the range of x for each match (eI{k}(ek, lk), M(mk, |lk|)) is limited to a window twice the size of the reference model and centered at the center of eI{k}(ek, lk). A map describing the probability distribution of the locations of fruit centers is obtained after the PHT process is completed (see Fig. 10a). This map is segmented by a given threshold, and a non-maximal suppression procedure is performed to remove repetitive candidates (see Fig. 10b).

Fig. 10
figure 10

Example illustrating the result of PHT. a Probabilistic map of Fig. 8, where the black circle refers to the center of the possible fruit; b possible fruit centers in the RGB image

Features extraction and classification

Because some false positives exist after PHT, a feature vector for each candidate is extracted and classified via an SVM classifier for removing false positives. Where there is low-contrast between fruit and leaves, such as bitter gourd, texture features are more discriminative than color features. Where there is high-contrast with leaves, such as citrus, color features are preferable. In this context, a color and texture-based descriptor, named color-HOG, is proposed to describe all kinds of fruits. The color component of color-HOG comprises the mean and covariance of the HSV values in a window centered at a possible fruit center. As the covariance is a symmetric matrix, duplicate elements are not included in the color-HOG features. The texture component of color-HOG is histograms of oriented gradients (HOG) (Dalal and Triggs 2005), of which each cell is set to \(8 \times 8\) pixels and each block is \(16 \times 16\) pixels.

As a comparison, Linker et al. (2012) proposed a similar color and texture-based descriptor comprised of the statistics of RGB values in a square window centered at the pixel of interest. This descriptor is sensitive to illumination change, because varying illuminations would alter the RGB values of objects. The proposed descriptor utilizes the HOG features extracted from the gradient image which is invariant to lighting conditions, and thus relatively discriminative.

Results and discussion

Two quantitative experiments were performed, one for verifying the performance of the color-HOG descriptor, and one for examining the detection performance of the proposed algorithm. All codes were implemented using MATLAB (R2014a) on a computer with 4 GB RAM.

Evaluation of color-HOG descriptor

The color-HOG descriptor was evaluated by comparing two state-of-the-art descriptors, HOG (Dalal and Triggs 2005) and LBP (Ahonen et al. 2009). These three descriptors were utilized to train the SVM classifier, and precision and recall were used to evaluate the classification performance. Both positive and negative samples were obtained manually from one-third of the images of each dataset, among which 75% of samples were used as a training set, and the others were a test set. Note that 10-fold cross validation and a grid search method were utilized to optimize the parameters of SVM. Table 2 lists the experiment results of the three descriptors on the test set, which clearly showed that color-HOG outperformed HOG and LBP in terms of precision and recall.

Table 2 Precision and recall of the color-HOG, HOG and LBP descriptors

Evaluation of detection performance of the proposed algorithm

The performance of the proposed algorithm (named “PSMPHT”) was verified and compared with two methods, one was a similar contour-based detection method (Ma and Latecki 2011), and one was a modified version of the developed method by replacing PHT with a graph clustering method (Bulò and Pelillo 2013), named “PSMGC”. A detection was considered correct if the intersection of the detected and the ground-truth bounding box over their union was larger than 50% (Everingham et al. 2015). Precision and recall were used to assess the detection performance. Note that since one-third of the images per category were used to train the SVM, only the remaining images were subject to this experiment.

Tables 3 and 4 list the precision and recall of the three methods. PSMPHT had two best precisions and four best recalls, PSMGC obtained three best precisions and two best recalls, and Ma and Latecki’s method only had two best precisions, i.e., both PSMPHT and PSMGC obviously outperformed Ma and Latecki’s method. The main reason is that PSMPHT and PSMGC use the proposed bidirectional PSM, which searches the sub-fragments of interest along the input fragment from both the clockwise and anticlockwise directions, thus obtaining more confident candidate sub-fragments than the unidirectional PSM used by Ma and Latecki (2011). The developed shape descriptor encodes the tangent and normal properties of the fragment, which also contributes to the favorable performance of PSMPHT and PSMGC.

Table 3 Precision of PSMPHT, PSMGC, and Ma and Latecki’s method
Table 4 Recall of PSMPHT, PSMGC, and Ma and Latecki’s method

PSMPHT performed better than PSMGC in terms of the overall performance in this study. Because both PSMPHT and PSMGC use the proposed PSM to detect sub-fragments of interest, the performance difference between the two methods mainly lies in the aggregation method as-applied. As each match independently casts votes for the locations of fruit candidates, PHT can generate a multi-peak probabilistic map from which the local maxima are detected as fruit candidates. This characteristic allows PHT to obtain more confident candidates than GC and therefore enabled PSMPHT with larger precision and recall than GC. However, there was an exception. In the pumpkin category, as the pumpkin and its leaves have similar shapes, PHT would detect many false-positive candidates (Fig. 11a). This issue decreased the precision and recall of PSMPHT over the pumpkin category. A failure example is shown in Fig. 11. In general, PSMPHT has better overall performance than PSMGC in natural environments.

Fig. 11
figure 11

Failure example. a Shows possible locations of pumpkins detected by PHT. b Shows final results by SVM classifier

The mango dataset results showed that these three algorithms achieved high precision, as the nighttime environment inhibited the effects of illumination changes and thus improved the SVM classifier performance. Additionally, PSMPHT obtained a higher recall than the other algorithms. The main reason was that only one sub-fragment on the mango contour was detected as the mangoes appeared very small in the captured image, so GC could not find reliable maximum weight subgraphs from the constructed weighted graph, but PHT could still cast reliable votes for the fruit location. This result again confirmed that PSMPHT is highly effective.

Several detection results are shown in Fig. 12. These examples intuitively validated that PSMPHT is effective. Some advantages are presented: (a) the proposed algorithm could detect any type of fruits, such as green, orange, circular, non-circular, etc., i.e., it was a generalized algorithm, and (b) the proposed algorithm was robust under complex conditions, such as occlusion and illumination changes (see Fig. 12ai), low contrast between fruits and leaves (see Fig. 12d) and cluttered background. Also, a disadvantage was uncovered: the developed method would obtain overlapping detections (see Fig. 12aii) or miss some true positives (see Fig. 12eiii). A possible reason is that the proposed PHT uses a scale-variant dissimilarity metric to cast probabilistic votes, so PHT would generate multiple local maximums for large-sized fruits or is unable to detect fruits with large scale changes. Therefore, future research will focus on improving the scale invariance of PHT.

Fig. 12
figure 12

Example illustrating detection results of PSMPHT

Conclusions

This paper presented a framework for detecting a wide array of fruit types—green, orange, circular and non-circular were tested here—in a natural environment. Detection of sub-fragments of interest was performed using a bidirectional PSM technique, which traces the input fragment from two opposite directions to find possible sub-fragments matching parts of a given reference model. These sub-fragments of interest were grouped as fruit candidates using PHT. False positives were excluded by an SVM classifier.

The proposed algorithm was evaluated on image datasets captured in the natural environment. The datasets comprised citrus, tomato, pumpkin, bitter gourd, towel gourd and mango images, i.e., several different types of fruits. For citrus, tomato, pumpkin, bitter gourd, towel gourd and mango categories, the precision of the proposed algorithm was 0.783, 0.848, 0.745, 0.762, 0.807 and 0.919, respectively; the recall was 0.802, 0.582, 0.678, 0.791, 0.754 and 0.825, respectively. The results are comparable to a similar contour-based detection method. In other words, the proposed algorithm has competitive advantages.

PSM and PHT were used for sub-fragment detection and aggregation without necessitating the painstaking design of specific features for each type of fruit. This makes the proposed algorithm a generalized method. PHT utilizes a scale-variant dissimilarity metric to determine the probability value of a vote, so it may fail to detect fruits with large scale change. In the future, a scale-invariant PHT will be established.