Keywords

1 Introduction

In precision agriculture, plant phenotyping is an important aspect. It helps scientists and researchers to collect valuable information regarding plant structure, which is inevitably a basic requirement to enhance plant discrimination and plant selection [1]. 3D reconstruction models of the plant through phenotyping operations are helpful for evaluating plant growth and yield over time. This permits management of the plant to be more extensive [2]. This 3D reconstructed plant models could be used to describe leaf features, discriminate between weed and crop, estimate the biomass of the plant, and classify fruits. Conventionally all these elements have been evaluated by experts in this field depending on the visual score, which was responsible for creating dissimilarity between expert judgements. In addition, this process is tedious.

Primarily, the aim of plant phenotyping is to calculate plant features precisely without subjective biases. Nonetheless, developing expertise and knowledge still lack technical advancements in processing and sensing technologies. Most of the modern sensing technologies are primarily only two dimensional, e.g. thermal or hyperspectral imaging. Inferring 3D information from such sensors is greatly reliant on distance and angle to the plants. In contrast, 3D reconstruction is being suggested for morphological classification of plants. 3D reconstruction is developing rapidly and getting tremendous attention. Structured light (Kinect sensor) [3], ToF cameras [4] and LiDAR [5] are active sensing techniques used for 3D reconstruction which basically use their own source of illumination. However, these state-of-the-art systems are costly. On the other hand, image-based passive 3D reconstruction techniques which use radiation present in the scene, which includes, structure-from-motion [6], stereo vision [7] and space carving system [8], only need one or two cameras which results in a very cost-effective system.

ToF cameras have excellent performance and appeared to be a suitable sensor for evaluating plants. ToF cameras are commonly combined with an RGB camera. Kazmi et al. [4] analysed the performance of ToF cameras for close range imaging in different illumination conditions. They found that ToF cameras deliver high frame rates as well as accurate depth data under suitable conditions. However resolutions of depth images are often low; the sensors are sensitive to ambient sunlight, which usually leads to poor performance while working outdoors; the quality of depth values depends on the color of objects, and some sensors have blurring problems while sensing moving objects. Because of these limitations, it is difficult to use TOF cameras for 3D reconstruction under outdoor conditions.

LiDAR is an expansion of the principles applied in radar technology. It estimates the distance between the target and the scanner by illuminating the target using a laser and calculating the time taken for the reflected light to come back [9]. Kaminuma et al. [10] presented an application of a laser range finder for 3D reconstruction which represents the leaves and as a polygonal meshes and then measured the morphological features from those models. Paulus et al. [11] determined that LiDAR is an appropriate sensor for obtaining precise 3D point clouds of plants but it does not give any information on the surface area. In addition, it had poor resolution and a long warm-up time. In contrast, LiDAR has given excellent results under outdoor conditions having a drawback of being very costly. Other disadvantages of the LiDAR sensor are that it needs calibration and multiple captures are required to overcome issues with occlusion. The data from the sensors cannot detect leaves overlapping efficiently and depth and images are not of high quality.

An alternative approach for depth estimation is the use of structured light. In this approach, the light source (either near-infrared or visible) is offset a familiar distance from an imaging device. The luminous from the emitter is reflected into the camera by the target object. Information about the light pattern allows the depth to be derived through triangulation [9]. Baumberg et al. [12] presented a 3D plant analysis based on the technique they called mesh processing. In this work, the authors made a 3D reconstructed model of a cotton using Kinect sensor which performed well under indoor conditions yet struggled under outdoor conditions. Chéné et al. [13] used a depth camera to segment plant leaves and reconstructed the plant in 3D.

As mentioned above, stereo vision and structure-from-motion use passive illumination, which allows these techniques to work efficiently under outdoor conditions. A off-the-shelf digital camera could be used for capturing overlapped images which are processed by a computer to estimate the depth or 3D reconstructed model. Stereo vision has comparatively lower cost than active sensing techniques and has provided excellent 3D reconstructed models. Nevertheless, the camera alignment and spacing between the cameras should be precise. As an illustration, the distance between the plant and camera, which is calculated with the help of the focal length of the camera, there should be an overlapping between the images and the rotation of the plant in different images. Ivanov et al. [14] described maize plants under outdoor conditions by using images captured from various angles to characterize plant structure. Takizawa et al. [15] reconstructed a 3D model of a plant and derived plant height and shape information.

The combination of images and cameras in structure-from-motion generally create a sparse 3D point cloud. Structure-from-motion consists of calculating a set of points from position of cameras, from these set of points, a dense point cloud is created. Jay et al. [16] proposed a method which builds a 3D reconstructed model of a crop row to get the plant structural parameters. This 3D model is acquired using structure-from-motion with the help of colour images captured by translating a single camera along the row. Quan et al. [17, 18] proposed a semi-automatic method for modelling plants for application like plant phenotyping, yield estimation based on structure-from-motion which performed well under outdoor conditions but it is computationally expensive.

In summary, each sensing technique has some merits and demerits [9]. The need of current sensors and systems is to reduce the need for manual extraction of phenotypic data. Their performance stays, to a lesser or greater extent, restricted by the dynamic morphological complications of plants [19]. Currently, there is not a 3D system and method which solves all necessities, but one should select depending on the budget and requirements. Moreover, plant structure is generally complex which includes a large amount of self-occlusion (leaves blocking one another). Hence, reconstructing plants in 3D in non-invasive manner stays a serious challenge in phenotyping.

Fig. 1.
figure 1

Sample of the captured images with different view angles of a chilli plant

Focusing at contributing a cost-effective solution to above challenge, we present an image-based 3D reconstruction system under outdoor conditions. Our contributions include:

  1. 1.

    An easy and cost-effective system (using just a mobile phone camera)

  2. 2.

    Investigation of effects of adverse outdoor scenarios and possible solutions (movement of plants because of wind and change in light condition because of the movement of clouds while capturing the images)

  3. 3.

    A precise 3D model obtained from a limited number of images

The rest of the paper is organized as follows; Sect. 2 discusses the step by step results of the method we used in this paper. The effect of adverse outdoor scenarios on 3D model along with the possible solution discussed in Sect. 3.

2 Materials and Methods

We selected a chilli plant Capsicum annum L. on a commercial field (Palmerston North, New Zealand) for testing our image processing. This chilli plant is selected for its demand over the year and its high value. Images were acquired during December 2017 when plant height was between 15 cm to 20 cm. A crop was planted in lines 90 cm apart, our experiment aimed at modelling individual plants. As a result, other plants did not hinder in the model and only one plant was monitored at a time.

2.1 Image Acquisition

The images were captured sequentially following a circular path with respect to the plant axis. Seven different rounds were taken at various angles, heights and distance. At least 15 images were captured at each path by revolving around the plant with a mobile phone’s rear camera (Apple iPhone 6s+ with 12MP rear camera, f/2.2), capturing at every 10\(^\circ \) to 15\(^\circ \) of the perimeter. The distance between plant and camera was not kept constant. These seven rounds made during the image acquisition process produced 105 images with 95% overlap between successive images. Images were taken under outdoor conditions. This gives us variety of images to work with. The camera positions were chosen to ensure that the plant was entirely in the field of view, and the images were of a good quality (not blurred etc.). Structure-from-motion calculates the intrinsic camera parameters by itself, so camera positions do not have to be calibrated during the image acquisition process. Samples of the captured images with different view angles of chilli plant and image acquisition scheme is shown in Figs 1 and 2 respectively.

Fig. 2.
figure 2

Image acquisition scheme

2.2 Plant-Soil Segmentation

As we are conducting our experiment under outdoor conditions, plant-soil segmentation has to be robust. This step is to distinguish plant-pixels from soil-pixels. As this process is applied to every image, this segmentation has to be autonomous. The improved vegetation index, excess green (ExG) [20] has been used, which is defined as:

$$\begin{aligned} ExG = \frac{2G-R-B}{R+G+B} \end{aligned}$$
(1)

where, R, G, and B are the red, green, and blue pixel components respectively. With ExG, pixels associated with the plant class generally have high ExG values. This makes the discrimination between plant and soil easier. Figure 3 depicts plant-soil segmentation of one of the views.

Fig. 3.
figure 3

Plant-soil segmentation based on improved vegetation index

2.3 Keypoint Detection and Matching

After segmentation between plant and soil, the next task is to find the common keypoints (features) between a pair of images. For this process, we implemented the scale-invariant feature transform (SIFT) [21]. We converted an image into a huge set of keypoint vector, all of them is invariant to image scaling, rotation and translation. The standard steps in SIFT are.

  1. 1.

    Formation of a scale space: A basic step of calculation explores over each and every scales and image locations. It is achieved decisively using difference-of-Gaussian (DoG) function to determine potential keypoints that are scale and orientation invariant.

  2. 2.

    Locating keypoints: As we located the possible keypoints, a structured model is fit to identify scale and location. These keypoints are chosen hinged on their stability.

  3. 3.

    Assignment of orientation: Depend on local image gradient directions, one or additional orientations are elected to the location of every keypoint. Each and every operations are executed on image data which has been transformed corresponding to the elected scale, location, and orientation for each keypoint, by that giving invariance to all these transformations.

  4. 4.

    Keypoint descriptor: In the preferred scale in the region near to every keypoint, the local image gradients are calculated. There gradients are transformed into a delineation that permits for considerable levels of change in illumination and local shape distortion. Figure 4 illustrate the keypoints detected in two images.

  5. 5.

    Matching of keypoints: Keypoints are matched between pair of images of an object or a scene captured from different view points and angles. Matching is based on finding similar keypoint feature vectors between the two images. Figure 5 shows matching keypoints between two images. The matches are then filtered to remove outliers, and bundle adjustment is used to create a sparse 3D point cloud of matching object or scene and to retrieve camera calibration intrinsic, extrinsic parameters and positions at the same time. Pyramid-like symbol in Fig. 6 represents the positions and angles of the camera and green dots represent the plant structure.

Fig. 4.
figure 4

Keypoint detection in an image pair

Fig. 5.
figure 5

Keypoint matching in an image pair

Fig. 6.
figure 6

Sparse 3D point cloud of the scene

2.4 3D Reconstruction

Finally, the calculated camera positions, parameters, and orientations are used to create a dense 3D point cloud. We implemented a cross-correlation matching method. For a pair of overlapped images, a pixel in the first image is corresponded with the pixel corresponding in the second image on the epipolar line [7]. This process is iterated for each pair of images keeping in mind that the calculated position of a given keypoint to be less noisy. The derived dense 3D point cloud is shown in Fig. 7, because of the page limitations we have added just two views as a resultant 3D model.

Fig. 7.
figure 7

3D reconstructed model of chilli plant

2.5 Post Processing

The dense 3D point cloud is post processed off-line in an open source software named Meshlab [22]. This software is used to process unstructured dense 3D models using filters and remeshing tools, which helps to clean, smooth and manage our dense 3D model, which helps us to solve the quantization issue. Figure 8 shows the cleaned entire 3D model.

Fig. 8.
figure 8

Clean 3D reconstructed model with overlapping leaves

2.6 Selection of Appropriate Number of Images

It is very tricky to decide the number of images needed for plant 3D reconstruction, and hence it is an important factor. In general, a larger number of images will give additional information about the plant. At the same time, it will hold redundant data because of the overlapping regions of same scene, and it will take extra computation time to process more images. Moreover, it was noticed during our experiment that a large number of images caused feature matching error which inevitably affects the accuracy. In contrast, with few images, the output 3D model will lack necessary data about the plant. We determined during our experimentation that it is quite difficult to reconstruct the plant in 3D using just 3–4 images, which cover only a limited range of viewpoints.

So based on our above investigation, hypothesis around the connection between multi-view information capturing and the trait of interpreted virtual view were tested to find an appropriate balance between multi-view information capturing and the quality of the 3D reconstructed model [23].

Fig. 9.
figure 9

Camera model

Figure 9 illustrates the camera model used in this experiment. Zi is the distance between plant and camera. \(\omega \) is an arc which shows the space between each view from the camera with radius L in same pitch which is \(=\varDelta {x}.\,\,{f _{l}}\) is a focal length of the camera. \(Z _{max}\) is the maximum depth of the plant and \(Z _{min}\) is the minimum depth of the plant.

Based on these assumptions and model, an appropriate number of images for 3D reconstruction can be calculated based on the below formulas:

$$\begin{aligned} \varDelta {x_{max}}= \frac{1}{f _{l}} \frac{Z _{max}}{Z _{min}} \end{aligned}$$
(2)
$$\begin{aligned} f_{nyq}= \frac{L}{2\varDelta {x_{max}}} \end{aligned}$$
(3)

where

$$\begin{aligned} L= \frac{2 \times Z _{max}Z _{min} }{Z _{max} + Z _{min} } \end{aligned}$$
(4)
$$\begin{aligned} N_{A}= \frac{\omega L }{\varDelta {x_{max}} } \end{aligned}$$
(5)

We selected 30 as an appropriate image number for 3D reconstruction based on the above theory and formulas. Due to the page limitations we are not presenting the step by step calculation of the aforementioned formulas but as it is straight forward theory, it is easy to estimate the appropriate number of images.

3 Discussion

The step by step results of the experimentation have been shown throughout the paper. It is difficult to quantify the quality of the 3D model but according to the rule of thumb, quality of the 3D model is a function of its input size to realism it produces. So to validate our result, visual analysis of the 3D models we achieved (Fig. 8) are compared with the result presented in [24], which illustrate that our 3D models are having better quality as our models are not missing any details like petioles, surface of the leaves and flower buds. There are different validation approaches given in literature. Several studies involved the extraction of 2D visual records and compared to measurements achieved by manual phenotyping. Another approach is to use the different databases that have allowed researchers to assess the accuracy of their 3D model [8, 25].

In this experimentation, we captured numerous images of the chilli plant and the number is ranged from 5 to 100 images. We selected 30 as an appropriate image number according to the theory presented and the quality of 3D model.

3.1 Effect of Adverse Outdoor Scenarios:

Based on our literature survey and our outdoor experimentations, we analysed that there are still some scenarios which cause problems and need more attention. In general, we know that sometimes under outdoor conditions it can be windy. We acquired another set of images of plants in windy condition, where plants were moving. In another scenario we captured the images when there was change in light conditions because of movement of the clouds. Here, we tried to investigate the effect of these scenarios under outdoor conditions and the effects of these on the resulting 3D models.

Fig. 10.
figure 10

Poor feature matching and 3D reconstruction due to heavy wind

Movement of Plant: In this scenario, we noticed that, because of the displacement of the plant due to wind, there were many feature matching errors resulting in poor 3D model. The resulting 3D model was missing important details in the stem area of the plant with some half reconstructed leaves (see Fig. 10). One possible solution for this scenario is to detect the inconsistent matches between the images because of the wind and filter out those images from the database.

Fig. 11.
figure 11

Poor feature matching and 3D reconstruction due to drastic change in illumination

Change in Illumination: In our study, we try to reduce the error caused by change in illumination, with good results. However, in certain scenarios, there could be a drastic change in the illumination while capturing the plant images. We studied that, in this scenario, the resulting 3D model was missing necessary information about the plant such as plant surface and leaves resulting in blank patches in the 3D model, shown in Fig. 11. One possible solution for this scenario is to pre-process and normalise the acquired images first to reduce the effect of change in illumination in the database.

4 Conclusion

Plant phenotyping is achievable using our approach. Arguably, results of our experiments demonstrated that chilli plant 3D reconstruction is feasible with a low budget and could be used in different scenarios, even under outdoor conditions. Our contribution contains: (1) Easy and cost-effective system operated under outdoor conditions and achieved good results. (2) Investigation of adverse outdoor scenarios and effect on 3D model (3) The appropriate number of images were selected and used for reconstruction. (3) An entire 3D model with limited images. (4) Automatic plant-soil segmentation is implemented. This 3D reconstruction system is gives a cost-effective and efficient platform for non-invasive plant phenotyping, containing informations such as, fruit volume, leaf angles, leaf area index, which are important for assessing the stress and growth on plant features.