Keywords

1 Introduction

The agricultural industry is a vital sector of the global economy and plays a crucial role in the human food supply. Food production is constantly evolving, and technology has been a significant ally in this process. Vision-based systems and artificial intelligence have enabled significant improvements in quality and productivity. In this context, this work aims to develop a computer vision strategy for automatically estimating fruit maturation stages from a single photo acquired by handheld devices. This will allow for a more accurate and efficient evaluation of the fruit production process, reducing resource waste and increasing agricultural sector productivity.

While computer vision systems have provided valuable information for crop monitoring, the assessment of fruit ripeness has received limited attention. Existing systems primarily rely on drone-based monitoring methods, which do not allow for the determination of the specific maturation stage of each fruit. Alternatively, some researchers have developed systems to classify the maturation stage of fruit after harvesting, which may not be particularly useful for farmers.

Considering the importance of determining fruit ripeness for optimal harvesting decisions, we introduce a method for determining the maturation stage of fruits using a single photo acquired from handheld devices while the fruit is still on the tree. The proposed method relies on an image segmentation model to crop and align the fruit, which a CNN subsequently analyses for extracting a visual descriptor of the fruit (illustrated in Fig. 1). The maturation stage is defined by a set of physicochemical parameters that are inferred from the visual descriptor using a regression model. To allow the learning of the image segmentation and regression model, we collected a dataset of 400 images of figs and prickly pears and their corresponding physicochemical parameters. To the best of our knowledge, this is the first dataset comprising both visual and physicochemical data, and we expect it to be of particular interest to the research community for carrying out studies of the relationship between the chemical properties of fruits and their visual appearance. The dataset used in this work is publicly available on https://github.com/Diogo365/WildFruiP.

Our main contributions in this work are as follows:

  • We introduce a strategy for fruit ripeness estimation capable of operating in images acquired in the wild while the fruit is still in the tree.

  • We assessed the performance of the proposed method in determining a set of physicochemical parameters of a fruit using a single image obtained in the visible light spectrum.

  • To foster the research on the problem of fruit ripeness estimation from visual data, we introduce a dataset comprising 400 images from two fruit species and their respective physicochemical parameters, which serve as a proxy to the fruit maturation stage.

2 Related Work

2.1 Detection Methods

Object detection in images is a crucial task in computer vision, which had a tremendous progress in the last years due to the emergence of deep learning.

Several works have taken advantage of this progress for fruit detection. In [12], Parvathi et al. proposed an enhanced model of Faster R-CNN [10] for detecting coconuts in images with complex backgrounds to determine their ripeness. The performance of the model was evaluated on a dataset containing real-time images and images from the Google search engine. The results showed that the improved Faster R-CNN model achieved better detection performance compared to other object detectors such as SSD [7], YOLO [9], and R-FCN [2].

2.2 Segmentation Methods

Image segmentation is crucial for fruit image analysis, as it allows separating fruits from other parts of the image, such as leaves or background. Mask R-CNN [4] is an instance segmentation method that has proven effective in object segmentation tasks and has been extensively used for fruit analysis applications.

Siricharoen et al. [13] proposed a three-phase deep learning approach [13] to classify pineapple flavor based on visual appearance. First, a Mask R-CNN segmentation model was used for extracting pineapple features from the YCbCr color space. Then, a residual neural network pre-trained on COCO and ImageNet datasets was utilized for flavor classification. The authors concluded that their model successfully captured the correlation between pineapple visual appearance and flavor.

Ni et al. [8] developed an automated strategy for blueberry analysis. They employed a deep learning-based image segmentation method using the Mask R-CNN model to count blueberries and determine their ripeness. The results indicated variations among the cultivars, with ‘Star’ having the lowest blueberry count per cluster, ‘Farthing’ exhibiting less ripe fruits but compact clusters, and ‘Meadowlark’ showing looser clusters. The authors highlighted the need for objective methods to address fruit ripeness inconsistency caused by annotation inconsistencies in the trained model.

2.3 Methods for Estimating Fruit Ripeness in Images

Several strategies have been introduced to enable pre-harvest in-field assessment of fruit ripeness using handheld devices [6]. However, most approaches rely on the non-visible light spectrum, requiring thus dedicated hardware [11].

Regarding the approaches devised for visible light spectrum, most of them use CNNs for the estimation of fruit ripeness. Appe et al. [1] proposed a model for tomato ripeness estimation using transfer learning. They relied on the VGG16 architecture, where the top layer was replaced with a multilayer perceptron (MLP). The proposed model with fine-tuning exhibited improved effectiveness in tomato ripeness detection and classification. In another work, Sabzi et al. [12] developed an innovative strategy for estimating the pH value of oranges from three different varieties. A neural network was combined with the particle swarm optimization [5] to select the most discriminative features from a total of 452 features obtained directly from segmented orange images. This approach was able to rely on a subset of six features to obtain an accurate estimation of the pH values across different orange varieties.

In short, few approaches were devised for addressing the problem of fruit ripeness estimation from visual data, ranging from traditional extraction of handcrafted features to deep-learning-based methods.

3 Proposed Method

The proposed approach can be broadly divided into three principal phases: the detection and segmentation of fruits in an image, the alignment and cropping of the fruit, and the determination of the physicochemical parameters of the fruit. The pipeline of this method is presented in Fig. 1.

3.1 Fruit Detection and Segmentation

This phase aims at removing the spurious information from the image keeping only the fruit region. Accordingly, the fruit is segmented automatically using the Mask R-CNN [4] allowing the prediction of a binary mask containing the pixels where exists a specific type of fruit. Considering the specificity of this task, the Mask R-CNN was fine-tuned on the proposed dataset, allowing thus it to generalize to the fruits targeted in this problem. To address the problem of multiple fruits in the image, we establish that the fruit to be analysed should be in the center of the image, and thus the remaining masks are discarded.

Figure 2 depicts the results obtained by applying Mask R-CNN to both figs and prickly pears.

3.2 Image Alignment

Considering that fruit orientation varies significantly in the images, it is particularly important to enforce a standard alignment to ease the learning of the fruit analysis model.

Considering the general shape of fruits, we propose to approximate their silhouette using an ellipse. Also, we concluded that the silhouette of the fruit can be modeled using the segmentation mask obtained from the previous phase.

Let M be the segmentation mask, and consider the general equation of the ellipse:

$$\begin{aligned} \frac{{((x - x_0) \cos \theta + (y - y_0) \sin \theta )^2}}{{a^2}} + \frac{{(-(x - x_0) \sin \theta + (y - y_0) \cos \theta )^2}}{{b^2}} = 1, \end{aligned}$$
(1)

where \((x_0, y_0)\) are the coordinates of the ellipse’s center, a and b are the horizontal and vertical semi-axes, respectively and \(\theta \) is the ellipse orientation, with \(\theta \in \left[ -\frac{\pi }{2}, \frac{\pi }{2}\right] \). The boundary of M is determined using the convex hull of the (xy) points of M, and least square fitting [3] is used to determine \(x_0\), \(y_0\), a, b, and \(\theta \). The rotation angle \(\theta \) is then used to rotate the original image and crop the fruit region based on the minimum bounding box containing the ellipse obtained. The results of the fruit alignment can be observed in Fig. 3.

3.3 Determination of Physicochemical Parameters

In the third phase, a CNN model is used to learn a visual descriptor which can encode the discriminative information regarding the physicochemical parameters of the fruit. A multi-layer perceptron is used as a regression model to infer the nine physicochemical parameters from the visual descriptors. The CNN and the regression model were trained in an end-to-end manner using the mean-squared error loss, and a k-fold cross-validation technique was adopted due to the reduced amount of training data.

4 Dataset

Considering the unavailability of public datasets comprising fruit images and their corresponding maturation stage or physicochemical parameters, we acquired 4 photos each from 60 figs and 40 prickly pears from local farmers. The fruits were subsequently harvested and analysed in the lab to extract 9 characteristics that are typically correlated with the maturation state of the fruit. The physical and chemical parameters obtained are listed in Table 1.

Table 1. Range of values for the physicochemical parameters used in this project.

To allow the development of a custom image segmentation model, we annotated the complete set of 400 images using the CVAT tool. An exemplar from each of the fruit species and its corresponding annotations can be observed in Fig. 4. To foster the research on the problem of estimating fruit ripeness from visual data, we make our dataset publicly availableFootnote 1.

5 Experiments

This section reports the performance of the proposed method for the problem of physicochemical parameter estimation from images of figs and prickly pears acquired using handheld devices. Tests are conducted using the aligned, and misaligned/cropped dataset using different neural networks. Also, we compare the proposed approach with a state-of-the-art method devised for inferring fruit physicochemical parameters from visual data.

5.1 Implementation Details

Detection and Segmentation. The backbone of the Mask R-CNN was a Residual Neural Network (ResNet), specifically the ResNet50 variant integrated into the PyTorch framework. Prior annotations were necessary for each fruit, including bounding boxes, labels, and masks to train the model. The data augmentation transformations were resizing, horizontal flipping, brightness and contrast adjustment. After defining the necessary transformations for data processing, the annotated initial dataset was split into 80% for training and 20% for testing. Finally, with the separated datasets and processed data, the model was trained for 50 epochs using the stochastic gradient descent optimizer with a learning rate of 0.001.

Determination of Physicochemical Parameters. The training data consisted of a set of images and their corresponding physicochemical parameters. Each parameter was normalized using a linear transformation estimated from the training data. A lightweight CNN architecture (ResNet18) was used for extracting 2048 dimensional visual descriptors from the aligned fruit images and a multi-layer perceptron was exploited for the estimation of nine parameters from the visual descriptors. The configurations used are presented in Table 2. All models were trained for a maximum of 100 epochs using the Early Stopping regularization technique and all of our experiments were conducted on PyTorch with NVIDIA GeForce RTX 3060 GPU and with Intel(R) Core(TM) i7-10700 CPU @ 2.90GHz. The inference times reported in Table 3 were obtained by executing the model on this hardware configuration.

Table 2. Configuration used for training the CNN.
Table 3. Inference time and total size of the different models.

5.2 Metrics

To assess the performance of the proposed model, four metrics were employed, mean squared error (MSE), mean absolute error (MAE), mean absolute percentage error (MAPE), and the coefficient of determination (\(R^2\)). They are defined as follows:

$$\begin{aligned} MSE = \frac{1}{n} \sum _{i=1}^{n} (y_i - \hat{y}_i)^2, \end{aligned}$$
(2)
$$\begin{aligned} MAE = \frac{1}{n} \sum _{i=1}^{n} |y_i - \hat{y}_i|, \end{aligned}$$
(3)
$$\begin{aligned} MAPE = \frac{1}{n} \sum _{i=1}^{n} \left| \frac{y_i - \hat{y}_i}{y_i}\right| \times 100, \end{aligned}$$
(4)
$$\begin{aligned} R^2 = 1 - \frac{{\sum _{i=1}^{n}(y_i - \hat{y}_i)^2}}{{\sum _{i=1}^{n}(y_i - \bar{y})^2}} \end{aligned}$$
(5)

In these equations, n represents the total number of data points or observations in the evaluation set, \(y_i\) denotes the true value of the dependent variable for the \(i^{th}\) observation, \(\hat{y}_i\) represents the predicted value of the dependent variable for the \(i^{th}\) observation, \(\bar{y}\) denotes the mean value of the dependent variable across all observations.

Table 4. Performance of the proposed approach. The \(R^2\) value (mean ± std) determined for both species denotes a strong predictive power for some physicochemical parameters. Also, the comparison with the approach of Sabzi et al. [12], evidences a clear improvement in all parameters.

5.3 Performance of the Proposed Approach

The proposed method was assessed in the evaluation split of both prickly pear and fig images using k-fold validation and repeating the training and evaluation process 10 times. The results are reported in Table 4.

The analysis of the results with respect to prickly pears shows a moderate correlation (refer to \(R^2\)) between some physicochemical parameters and the predictions of the network obtained from the fruit image. All parameters showed a positive correlation except for the length. Insufficient relevant information in the image might explain the lack of correlation with the prickly pear’s length. Visual features such as shape, color, or texture are not informative about the length of a fruit. A strong predictive power was obtained for the ‘a’ parameter, hardness, and ‘b’ parameters, with correlations of 0.83, 0.51, and 0.42, respectively. The ‘a’ parameter represented the fruit chromaticity from green to red, which is strongly correlated with the fruit ripeness. Hardness also corresponds to ripeness, as riper figs are typically less firm. However, the sugar content, measured by the TSS (oBrix) had a weak correlation possibly due to the dataset small size.

Table 5. Results obtained by the proposed model with the misaligned, and aligned datasets using prickly pears.

Regarding the performance attained on figs, only six out of nine parameters were evaluated due to insufficient data for TSS (o Brix), pH, and mass. Nevertheless, our approach demonstrated a better aptitude for estimating physicochemical parameters in this fruit species (figs), likely due to the disparity in dataset sizes.

Regarding the comparison with the state-of-the-art, the method of Sabzi et al. [12] significantly underperformed when compared with our approach. The main justification for this difference is the fact that the method of Sabzi et al. [12] was originally intended to analyse fruit images in controlled scenarios (the method was devised for pH estimation of oranges in a uniform background). However, the images obtained when the fruits are still on the tree are inherently more challenging due to the varying pose, lighting, and complexity of the background.

5.4 Impact of Alignment Phase

In this experiment, the model was trained using misaligned/cropped, and aligned images for the nine physicochemical parameters.

Prickly Pears. Upon analyzing Table 5, it was observed that ablating the alignment of the images, as expected, led to worst results, making the difference between negative and positive values of \(R^2\) as is the case of the TSS (oBrix), pH and L parameters.

Table 6. Results obtained by the proposed model with the misaligned, and aligned datasets using figs.

Figs. Regarding figs, upon analyzing Table 6 it was observed that the aligned dataset yielded slightly better results compared to the misaligned dataset. The aligned dataset led to improved training results for the color parameters, while the misaligned dataset performed better for shape features (diameter and length).

The diameter parameter proved challenging but outperformed the worst parameter in the prickly pear experiment.

5.5 Impact of Model Architecture

Considering that the proposed approach is planned to work in handheld devices with low computational resources, the proposed method is based on a lightweight architecture. To determine the best architecture for the problem, we compared the impact of the architecture on the performance of the proposed approach, as well as, on the inference time.

Therefore, for this experiment, we assessed the performance of our approach using two lightweight architectures: MobileNetV2 and ResNet18. The comparison of the model size and inference time of the different architectures is provided in Table 3, while Table 7 reports the performance of our approach along the different architectures.

Table 7. Results obtained for the \(R^2\) metric on prickly pears and figs utilizing the ResNet18 and MobileNetv2.
Fig. 1.
figure 1

Pipeline of the proposed approach. The fruit image is given to an image segmentation approach, which determines the fruit mask. Using the mask, a fitting process is performed to enclose the fruit in an ellipse, and the rotation angle of the ellipse is used to align the fruit in the image. Afterwards, the fruit is cropped using the bounding boxes also extracted from the segmentation mask. The cropped fruit is fed into a CNN for extracting a visual descriptor which is subsequently mapped to a set of physicochemical parameters through a regression model.

It is interesting to observe that ResNet18 was able to consistently attain the best results over all parameters and simultaneously for both fruit species. Despite its larger size, we claim that the superior predictive power obtained justifies its use in this problem. Also, it is important to note that the inference time is equivalent for both models.

5.6 Hard Samples

To further explain the obtained results, an additional test was conducted to identify figs where the proposed method significantly deviated from the correct physicochemical parameters using the MAE metric (less sensitive to outliers than MSE). Figure 5 shows the images of the two fruit species where the proposed approach had the largest MAE.

Fig. 2.
figure 2

Fruit detection and segmentation. The Mask R-CNN was fine-tuned to provide a rough segmentation of the fruit allowing to discard irrelevant regions of the image in the analysis of the data. Even though the masks are not so accurate in the border, it is important to note that the accuracy of the segmentation mask is not crucial for the overall approach.

Fig. 3.
figure 3

Proposed alignment process. The fruit is approximated using an ellipse, which allows to obtain the rotation angle for image alignment and cropping the aligned fruit. The alignment process is depicted for the two fruit species considered in this study.

Fig. 4.
figure 4

Samples from the proposed dataset. Our dataset comprises 400 images of two fruit species and their corresponding physicochemical parameters. Also, we provide the location of fruit in the image using manually annotated bounding boxes and segmentation masks.

Fig. 5.
figure 5

Hard to predict samples. The five samples from the two fruit species that have the highest absolute error over the nine physicochemical parameters.

Several factors affect the performance of the model, including luminosity differences, blur, variations in fig shapes (length, diameter, and mass), and limitations in training due to a lack of examples of unripe figs.

6 Conclusion and Future Work Prospects

In this work, we introduced an approach for estimating the maturation stage of fruit images acquired in the wild using handheld devices. The proposed approach relied on an innovative alignment strategy that increased the robustness to pose variations. Also, we introduced a novel dataset containing images with significant variations in lighting, and diversity of the background. The experimental validation of the proposed approach showed a strong correlation with some physicochemical parameters, which can serve as a proxy to determine the maturation stage of the fruits considered in this study. On the other hand, our approach was capable of remarkably surpassing a state-of-the-art approach specifically designed for fruit maturation estimation. To further validate the proposed method, we carried out several experiments, which showed that the alignment phase increased the performance of the method. Also, the analysis of the most challenging image samples evidenced that blur and brightness variation were the major causes of failure. In the future, we expect that our approach can be incorporated into a mobile application, providing farmers with an easy-to-use fruit ripeness estimation tool for efficient control and informed decision-making in agriculture.