1 Introduction

Corn is a vital crop in North America that requires constant monitoring to ensure good production. Analyzing plant density is crucial in assessing crop consistency and providing valuable information for forecasting crop productivity [1]. However, to fully understand the consistency of crops across large fields, it is important to consider other parameters, such as inter-plant distances, percentage of crop coverage, and population of plants per row. This comprehensive analysis offers valuable insights for farmers and agronomists to implement variable rate technologies, enabling efficient management and control of crop growth [2].

The traditional method of assessing crop consistency involves manually counting crops and finding inter-plant distance on predefined sample areas in the field. This limits the view of the crops’ status and becomes laborious and expensive when applied on a large scale. Developing automated techniques for crop consistency analysis can overcome these problems and enable farmers to make informed decisions about crop establishment and early intervention, leading to optimized crop production and reduced wastage. These techniques can continuously monitor crops and analyze various characteristics, such as size, shape, and color. Several agricultural applications use computer vision (CV) techniques, such as convolutional neural networks (CNNs), which can count corn plants [3], classify leaf diseases [4, 5], segment leaves [6], and estimate weed density [7]. Deeper convolutional feature representations enhance the capability of these models to extract semantic information, resulting in better generalization in diverse crop and weed conditions [8, 9]. Various architectures have been proposed to obtain plant densities and estimate their distribution across a corn field [10,11,12].

Kitano et al. use the deep CNN architecture UNet to segment corn from the background and a blob detection methodology (based on CV) to quantify the number of plants in the image [10]. However, this method lacks to provide precise locations for the detection. For on-ear corn kernel counting, Khaki et al. utilize truncated VGG-16 for feature extraction in their proposed DeepCorn framework, reporting a mean absolute error (MAE) and root mean squared error (RMSE) of 41.36 and 60.27, respectively [13]. However, this method performs poorly on images from different fields and uses more trainable parameters (26.61M), making it inadequate for large-scale data handling. Mota et al. use CNNs for detecting and counting corn plants in the presence of weeds [14]. They base their research on YOLOv4 [15] and YOLOv5 [16] architectures. Similarly, Guo et al. introduce a two-step machine learning-based image processing technique to detect and count sorghum heads using high-resolution images from unmanned aerial vehicles (UAVs) [17]. However, the robustness of these methods in different field conditions is undetermined, and assessing only plant stands counting is not sufficient to accurately evaluate the consistency of crops on large fields. Therefore, there is a need to develop an algorithm that considers multiple parameters to analyze crop consistency efficiently and can adapt to natural field conditions.

The identification of crop rows is a crucial factor in the analysis of crop consistency as it provides valuable data on crop emergence and inter-plant spacing. This information is used to monitor crop growth per row, which plays a significant role in identifying potential issues in crop development. Several CV techniques are developed to detect straight crop rows in an image, including the conventional Hough transform, linear regression, and green pixel accumulation [18,19,20]. The Hough transform is a popular technique that transforms the image into Hough space, detecting lines and geometric shapes in the original image. Linear regression fits a straight line to a set of points in the image, while green pixel accumulation counts the number of green pixels in a region of the image to infer the presence of straight crop lines. However, these techniques can misclassify weeds as crops, leading to false row detection [21]. They rely on certain assumptions about the characteristics of crops, such as their shape, size, and color, and are sensitive to variations in lighting and other environmental conditions. To overcome these limitations, fully convolutional networks (FCNs)-based semantic segmentation architectures have been introduced for row detection and crop coverage estimation tasks [22, 23]. These models can identify the pixels in an image corresponding to crops and weeds and assign each pixel to the appropriate class [23, 24].

Silva et al. introduce a method for crop row extraction using a UNet-based CNN model that performs semantic segmentation prior to post-processing steps [25]. They report an accuracy of 89.36%, which is lowered by 4.93% under difficult field conditions, such as high weed pressure and sparse crop content in misaligned crop rows. On the other hand, Tang et al. develop a sequence of image processing procedures to identify and estimate the distance between corn plants in crop row images [26]. Their algorithm utilizes various crop information, such as plant color, morphological features, and the center line of the crop rows. They report a spacing RMSE error of 1.7 cm and a coefficient of determination (\(R^2\)) of 0.96 cm. This method provides better results in estimating the spacing between crops during the initial growth stages, but the error increases in later stages and when the crops are damaged. To date, no studies have been found that employ FCN architectures to evaluate consistency based on plant spacing estimation, crop coverage area, and density per row.

We propose a Small Pyramid-UNet (SP-UNet) network that uses FCN to segment crop leaves and rows. Like UNet [27], our network has an encoder–decoder structure, with ResNet34 [28] as the backbone. In the decoder section, we have incorporated the feature pyramid network (FPN) architecture to extract rich, multi-scale features. By adding more context and detail, this feature extraction helps to improve the model’s performance. Additionally, our architecture aims to reduce the number of parameters, which helps to prevent overfitting and results in a more lightweight network. Despite having only 21.85M trainable parameters, our model has comparable prediction performance to more complex segmentation models like HRNet_Mscale (72.1M) and SegNet (34.65M). We observe that RNet_Mscale and SegNet exhibit poor performance in row extraction, especially in scenarios where plants within a row are sparsely distributed, which is frequently seen in corn crops. We utilize the Zhang–Suen thinning algorithm [29] to reduce the segmented row to a single-pixel structure and integrate it with the progressive probabilistic Hough transform (PPHT) [30], which assigns a line to the center of each row.

After the crop segmentation task, we estimate crop coverage through leaf area index (LAI) calculation. This task identifies extracted pixels in the image corresponding to leaves and measures the total leaf area. While many studies support semantic segmentation models for estimating LAI, they do not establish a correlation with other key elements that contribute to the analysis of crop consistency [31]. For example, in [32], the authors use the UNet architecture to estimate LAI but do not present other consistency parameters such as crop row detection, plant stand count per row, and inter-row and inter-plant distances.

We use the latest object detector, YOLOv7 [33], to accurately determine corn stand count. To avoid redundant results, we apply the Distance-IOU Non-Maximum Suppression (DIoU-NMS) technique [34]. This method considers both the detection boxes’ overlapping areas and center points, enabling better detection of target plants that may be obstructed. Our algorithm evaluates the number of plants per row, their spacing, and LAI to quantify consistency, as shown in Fig. 1. Our framework is effective for crops under biotic stress and at progressive growth stages. We believe that no other existing framework in the literature can accurately assess multiple crop parameters to quantify crop consistency.

Fig. 1
figure 1

Workflow of the proposed algorithm

This research paper presents three main contributions as follows:

  • A novel architecture called SP-UNet is proposed for segmenting crops and crop rows in the RGB images captured by field machinery-mounted cameras.

  • Implementation of the novel method for quantifying corn crop consistency under natural field conditions.

  • Establishing a correlation between inter-plant distance, plant stand count per row, and crop coverage percentage for crop consistency analysis.

The notations utilized in this work are summarized in Table 1.

Table 1 Summary of notations

2 Materials and methods

In this section, we present detailed information on the dataset and the QuanCro framework, which is utilized to evaluate the consistency of corn crops.

2.1 Data collection and its challenges

A dataset of 8000 images of corn fields at various growth stages is being utilized for this study. These images are taken using SWATCAM, a system based on an Android phone, mounted on field machinery as depicted in Fig. 2. The images measure \(1440 \times 1080\) pixels and are taken at intervals of 15.24–21.33m with a count of 17–20 images per acre. The dataset presents a range of difficulties, including low light or nighttime shots, motion blur, tilted angles, and varying boom heights. Additionally, there are instances of mid-stage corn plants and weeds overlapping, which can create further complications. It is crucial to carefully select the data used to train the model to ensure accurate detection of corn plants in diverse conditions. Some examples from the dataset are depicted in Fig. 3. The images in Fig. 3a show early-stage crops surrounded by weeds. Figure 3b exhibits shots of crops under different lighting conditions, with the crops overshadowed by field machinery in the left image. In contrast, the right image is captured under an auxiliary lighting source. Figure 3c displays images with mid-stage crop in an overlapping state.

Fig. 2
figure 2

SWATCAM cameras mounted on each sprayer boom of the field machinery (https://swatmaps.com)

Fig. 3
figure 3

Illustration of various types of challenges in the dataset

2.1.1 Data augmentation

Data augmentation increases the data by generating new data points from the existing data. This is achieved by implementing reasonable modifications to the original images, which are then added to the training dataset as new augmented copies of the images. The semantic segmentation dataset is modified through various geometric and photometric augmentations, such as adjusting the saturation and hue of the image’s color channels and applying random scaling, translation, rotation, and cropping. In addition, for plant stands detection, the dataset is augmented by applying CutOut [35], CutMix [36], MixUp [37], and Mosaic [15] augmentation techniques. These augmentation techniques involve applying modifications to multiple images at once. In the MixUp technique, a pair of images are superimposed onto each other through a weighting operation to create a new sample for which a new corresponding label is also generated. The CutOut method slices away a part of images that the model may rely heavily on during training. This gives the model a platform to learn all the elements in the images thoroughly. CutMix substitutes a slice of another image onto the small patch of zero-pixel areas generated by CutOut. CutOut and CutMix are both predecessors to the Mosaic augmentation method. This method combines images of different sizes into mosaic patterns and corresponding annotations. Subsequently, a random portion of this entire grid is selected and cropped to create a new augmented image, which is then utilized for training the network.

2.1.2 Data annotation

This experiment uses two different labeling tools to prepare the dataset. Bounding boxes for each corn plant are generated using the LabelImg labeling tool, and the rows and plants are segmented using the Segments.ai platform (https://segments.ai). An example image for labeling corn plants using bounding boxes is shown in Fig. 4a. Only pixels belonging to the plants are annotated for semantic segmentation, while the crop rows are segmented using a polygon tool, as illustrated in Fig. 4b. The row extraction takes account of the plants and the sparse spacing between each plant.

Fig. 4
figure 4

Demonstration of different labeling techniques

2.2 Methodology

In this research paper, a novel approach called QuanCro is presented, which leverages a semantic segmentation model to identify crop rows and estimate the percentage of crop leaves. Additionally, an object detection algorithm is employed to locate and count individual plant stands. The segmentation model is further integrated with the Zhang–Suen thinning algorithm and the PPHT to identify crop rows accurately. Using these parameters, the plant density per row and inter-plant spacing are estimated, and further calculations are performed to quantify the consistency of crops in a corn field.

2.2.1 Plant stand count using YOLOv7

The YOLOv7 is the latest addition to the YOLO series. This network is a single-stage detector that predicts all bounding boxes and class probabilities with just one pass through a CNN without requiring pre-generated region proposals. The YOLOv7 architecture consists of three main modules: a backbone, a neck, and a head. The backbone uses extended efficient layer aggregation networks (E-ELANs) to extract key features from input images. The E-ELAN uses shuffle and merge cardinality in the final layer aggregation to improve the feature learning capabilities of the network across layers, resulting in increased accuracy. The neck module collects feature maps from different stages and utilizes FPN for enhancement. The head module makes the final detection, and a modified post-processing non-maximum suppression (NMS) suppresses multiple detection boxes of the same plant, resulting in a reduced number of false positives.

To improve the inference speed, the architecture incorporates a new model re-parameterization technique that combines multiple computational elements into a single module. This technique is separated into two groups: model-level and module-level ensemble. The YOLOv7 uses a concatenation-based method for scaling the model, generating models of varying sizes, and ensuring the structure remains optimal during the execution stage. The YOLOv7 is trained on a corn crops dataset for plant stand counting. The trained model is used to locate corn plants by predicting bounding boxes, which are utilized to count the number of plants in the images.

The classical NMS technique is the last post-processing step in the detection algorithms [38]. It uses a probability-bound criterion system to filter out single detection boxes out of many overlapping boxes. These overlapping boxes have false-positive predictions that need to be suppressed. Another alternative is soft NMS which applies a continuous function based on IoU to reduce the detection score of neighboring bounding boxes. This results in a gentler and more resilient suppression than the classical NMS approach [39]. However, both of these techniques have shortcomings in occlusion scenarios. They encounter challenges in detecting objects from the same class in proximity and tend to delete the true predictions. This problem is addressed by adopting the DIoU-NMS method [34], which performs well in clustered scenes and effectively handles highly overlapping target objects. This method suppresses boxes by considering the distance between the center point of the corresponding detection boxes and the overlapping area. The distance-IoU penalty term denoted as \({R}_{{\text {DIoU}}}\) equipped with an adaptive threshold facilitates score penalization of the false neighboring boxes to achieve a refined detection box.

$$\begin{aligned} s_i=\left\{ \begin{array}{l} s_i, \quad I o U-\mathcal {R}_{{\text {D I o U}}}\left( \mathcal {D}, P_i\right) <\varepsilon \\ 0, \quad {{{\text {IoU}}}}-\mathcal {R}_{{\text {D I o U}}}\left( \mathcal {D}, P_i\right) \ge \varepsilon \end{array}\right. \end{aligned}$$
(1)

where \(\mathcal {D}\) represents the highest score of the detection box, while \(s_i\) and \(\varepsilon \) are denoted as the classification score and NMS threshold, respectively. Based on a predefined threshold value obtained using the suppression criterion, the lower score of the \(P_i\) box is eliminated.

2.2.2 Proposed semantic segmentation network: SP-UNet

The proposed SP-UNet follows a U-shaped encoder–decoder structure, similar to the original UNet [40]. The encoder contains convolutional and max pooling layers, while the decoder has upsampling and convolutional layers. The U-shape of the network facilitates feature localization in the input image and preserves spatial resolution in the segmentation output. The encoder produces feature maps of various scales by convolving and down-sampling the input, which is then concatenated with the corresponding upsampling layers of the decoder using skip connections. The skip connections enable the decoder to access and combine high-level global and low-level local features, which is crucial for accurate image segmentation, particularly for enhancing edge features.

To increase the effectiveness of feature extraction during the down-sampling process, the ResNet34 [28] module is added to the UNet encoder. The ResNet34 architecture comprises multiple residual blocks, each containing several convolutional and batch normalization layers, and has shortcut connections between each block’s input and output. The residual connections allow the gradients to flow more easily through the network, making it possible to train much deeper networks without the vanishing gradient problem. Integrating the ResNet34 architecture within the UNet encoder enhances the feature extraction capability of the model by utilizing the shortcut connections present in the ResNet34. This is particularly beneficial for image segmentation tasks as it allows for extracting high-level semantic information and fine-grained details from the input image, which is crucial for precise object recognition and classification [41]. To balance the number of trainable parameters and the model’s ability to extract features, only the first five blocks of the ResNet34 are used.

The decoder is made up of five upsampling blocks. Each block contains transpose convolutions followed by two convolutions, batch normalization, and ReLU activation. This increases the resolution of feature maps while reducing their depth. Using FPN improves the network’s performance by enabling it to incorporate multi-scale contextual information from different network levels. This is particularly useful in tasks like image segmentation, where the spatial layout of objects is important. For efficient crop row segmentation, the network needs to detect large row segments rather than specific vegetation regions in images. Utilizing an FPN in the decoder part of the network allows the network to consider global context and local details in the image. This leads to increased model performance with fewer parameters and reduced computation time (Fig. 5).

Fig. 5
figure 5

Building block of residual unit

The output features of the first four decoder blocks are up-sampled to a fixed size of \(384 \times 384\) and a depth of 64 (channels), creating four pyramids of the feature maps: \(P_1, P_2, P_3\), and \(P_4\). These pyramids with 64 channels are combined, resulting in a feature map with 256 channels. This is then concatenated with the output of the fourth upsampling layer and fed to the last fifth upsampling layer. This increases its resolution and matches the input size before being processed by the final block of the network. The final block consisting of three \(3 \times 3\) convolutions is applied to the output of the fifth upsampling layer, resulting in a \(768 \times 768\) feature map with 32 channels. The final step applies a \(1 \times 1\) convolution filter to the feature map, reducing the number of channels to match the number of classes (crop leaves and rows). This final feature map is then used to predict the class for each pixel in the image. A graphical representation of the model’s structure is shown in Fig. 6.

Fig. 6
figure 6

SP-UNet network architecture. The number of channels is represented by the vertical numbers on the left side, while the numbers below each map represent the size of the feature map

2.2.3 Crop row detection algorithm

The SP-UNet is used to segment the corn crop rows from the background, which includes the soil and weeds. The Zhang–Suen thinning method [42] is then applied to the segmented rows to obtain a minimal/pixel-wide representation, skeleton, of the crop rows [29]. This thinning method is an iterative process consisting of two passes. During each iteration, it examines whether it satisfies the following conditions before eliminating pixels from the outer sets of pixel regions that do not belong to the skeleton.

  • First pass

    1. 1.

      The chosen pixel f is black and is surrounded by eight neighboring pixels.

    2. 2.

      It has a minimum of two and a maximum of six black neighboring pixels.

    3. 3.

      The connectivity value of this pixel is one.

    4. 4.

      At least one pixel should be white in either east, north, or south position from pixel f.

    5. 5.

      At least one pixel should be white in either west, east, or south position from pixel f.

  • Second pass

    1. 1.

      The chosen pixel f is black and is surrounded by eight neighboring pixels.

    2. 2.

      It has a minimum of two and a maximum of six black neighboring pixels.

    3. 3.

      The connectivity value of this pixel is one.

    4. 4.

      At least one pixel should be white in either east, north, or west position from pixel f.

    5. 5.

      At least one pixel should be white in either west, north, or south position from pixel f.

This process uses a \(3 \times 3\) window to make calculations across the entire image in a clockwise direction. Through each pass, the pixel that satisfies the conditions will be eliminated and will continue until no more pixels are removed, and a more definitive structure is obtained. There are other alternatives for thinning algorithms, such as the Guo–Hall [43]. Guo–Hall deletes redundant pixels and produces thinner skeletons compared to the Zhang–Suen algorithm but does not at times retain the structure of the component accurately [44].

Following the implementation of the thinning algorithm, a Gaussian filter with a small standard deviation (SD) is employed to minimize noise. A CV technique called PPHT is adopted at the final stage to obtain a precise center line of each crop row. This adaptation of the probabilistic Hough transform (PHT) [45] performs the standard Hough transform [46] on only a pre-selected fraction of the points. The PPHT exploits the difference in the distribution of the votes required to accurately perform line detection with the help of the different number of supporting points. This results in a significant reduction in the computation required for PPHT in contrast to PHT, where prior knowledge and the information on the fraction of points utilized for voting are needed and hence is time-consuming.

Fig. 7
figure 7

Illustration of the detected and extended line

Because of the sparse inter-plant spacing at the edges of the images, an incomplete row length is obtained, leading to incorrect plants per unit calculations (plants/length of row). The detected lines are further extended to the edges to address this problem, as shown in Fig. 7. In addition to line detection, contour lines are incorporated around the row’s center line. This allows the plant detection and numerical calculations to be restricted within the contour lines margin to avoid detecting and counting voluntary plants that grow outside the row. The row detection process is depicted in Fig. 8.

Fig. 8
figure 8

Proposed crop row extraction workflow

2.2.4 Inverse perspective transformation

The images are fundamentally two-dimensional (2D) projections of three-dimensional (3D) objects. This perspective projection leads to distortion and fails to capture information on an object’s precise spacing and location from an RGB image [47]. The projection of parallel lines seems to converge and intersect at a point (vanishing point) when represented in a linear perspective. Some of the images in our dataset have crop rows converging at the vanishing point, which contrasts with the field plane, where the rows are parallel. This makes the field-scale calculation for estimating the distance between rows incorrect. The number of pixels at the image’s upper edge (near the vanishing point) corresponding to a certain physical distance differs from those at the lower edge. To eliminate these non-linear perspective effects, it is important to apply inverse perspective mapping. This technique takes every pixel from a 2D projection of 3D objects and remaps it to a new position, subsequently generating a new image representation on an inverse 2D plane [47]. The equations to support this mapping are formulated as follows:

$$\begin{aligned} F= \{(x, y, z)\} \in E s^3 \end{aligned}$$
(2)
$$\begin{aligned} I= \{(u, v)\} \in E s^2 \end{aligned}$$
(3)
$$\begin{aligned} u(x, 0, z)= \frac{\rho (x, 0, z)-(\theta -\mu )}{\frac{2 \mu }{k-1}} \end{aligned}$$
(4)
$$\begin{aligned} v(x, 0, z)\,=\, & {} \frac{\vartheta (x, 0, z)-(\varphi -\mu )}{\frac{2 \mu }{g-1}} \end{aligned}$$
(5)
$$\begin{aligned} \rho= & {} \tan ^{-1}\left( \frac{Z}{x}\right) \end{aligned}$$
(6)
$$\begin{aligned} \vartheta= & {} \tan ^{-1}\left( \frac{h}{\sqrt{x^2+z^2}}\right) \end{aligned}$$
(7)

The coordinates of the field plane are translated onto the image plane using (4) and (5) where \(\theta \) indicates the angle between the optical projection axis on the horizontal plane and \(\phi \) denotes the angle between the horizon and optical axis. The \(\mu \) and \(k \times g\) represent the camera’s angular aperture and the image resolution, respectively. Lastly, the angles \(\rho \) and \(\vartheta \) can be termed projection weighting factors for the pixels that change horizontally and vertically during the remapping process. In Fig. 9, the left image demonstrates the original field image where the crop rows’ dimensions are non-parallel, and its vertices are represented in a trapezoid shape. The right image is the bird’s-eye view resulting from the inverse perspective transformation.

Fig. 9
figure 9

Demonstration of perspective view image and Bird’s-eye view image

2.2.5 Plant spacing calculations

Once the plant detection model detects individual plant stands and determines the plant population across each row, the spacing between the adjacent plant stands is calculated. This numerical estimation is based on the distance calculation between the plants’ centroids (x,y). The Euclidean distance method is the most appropriate distance metric to determine the spacing within the neighboring plants in the rows [48]. Before this calculation, a distance matrix ensures no multiple distance estimation from each plant to the rest of the row. Similarly, another triangular matrix is generated to discard the repeated values that only initiate the distance calculation when the neighboring plant is closest to the next corresponding plant. The equation for the distance (in pixels) estimation is given by (8):

$$\begin{aligned} \text{ Distance } (F, G)^2=\left( F_x-G_x\right) ^2+\left( F_y-G_y\right) ^2 \end{aligned}$$
(8)

The subscripts x and y represent the centroid of the bounding box corresponding to plants F and G. The distance estimated in (8) is then translated onto field-scale (actual) values using inter-row distance obtained by (9):

$$\begin{aligned} D= \frac{\Vert C_2 - C_1\Vert }{\sqrt{1+m^2}} \end{aligned}$$
(9)
$$\begin{aligned} Y_1= m x+c_1 \end{aligned}$$
(10)
$$\begin{aligned} Y_2= m x+c_2 \end{aligned}$$
(11)

The line equations associated with each crop row are (10) and (11). D represents the shortest distance in pixels between them. As the crop rows are generally 0.3048m apart in the corn field, (12) calculates the scale for the actual distance corresponding to each pixel. This scale is then multiplied with each inter-plant distance, Distance (F, G) in (8), to obtain the distance in inches.

$$\begin{aligned} scale = 0.3048/D \end{aligned}$$
(12)
Fig. 10
figure 10

Distance estimation between each crop in their respective row shown for both mid- and early-stage crops images

The calculation based on the precise distance estimation between each plant provides a relative analysis of the consistency of the crops. The projected distance estimation on both mid- and early-stage crop images is demonstrated in Fig. 10. The blue fonts are used to denote the distance in pixels. Following the distance calculation, the coefficient of variation (CoV) is obtained, providing insight into determining the consistency of the crops.

2.2.6 LAI estimation from crop segmentation

LAI characterizes the structural attributes of a crop quantitatively to analyze the crop emergence at a particular stage. The SP-UNet model is trained to segment images into different classes, such as crops and rows. The segmented crop content calculates the total leaf area over a large field. The formulation of calculating the percentage of LAI is given as follows:

$$\begin{aligned} \text{ Percentage } \text{ of } \text{ crop } \text{ LAI }=\frac{\sum M_{\text{ crop } }}{\sum N_{\text{ all } }} \end{aligned}$$
(13)

\(M_{{\text {crop}}}\) represents all the pixels allocated in crop content while \(\sum N_{{\text {all}}}\) indicates the summation of all the pixels of the image, including crops/background/weeds.

The CoV value, along with the LAI estimate, is used as the deciding factor for assessing the consistency of the crops at a particular growth stage. These details are elaborately discussed in the next section.

3 Results

We conduct experiments to test the effectiveness of our proposed QuanCro method. We evaluate the accuracy of our plant counting model by comparing it to manual counts on a separate test dataset. We also compare the performance of different detection networks, including YOLOv3, YOLOv5, and YOLO7, based on their mAP and F1 scores. We visually assess the detection results to determine which model performs the best. Additionally, we analyze several segmentation architectures, such as SegNet, UNet, HRNet_Mscale, and DeepLabV3+, with different backbones like ResNet and VGG and compare them to our proposed SP-UNet model. Finally, we calculate numerical estimations to determine the consistency of corn crops and establish their correlation with the LAI from the SP-UNet.

3.1 Evaluation metrics

Several metrics can be used when evaluating a segmentation model’s performance. One of the most commonly used and effective methods is IoU, the Jaccard index. This metric is calculated by determining the ratio of the intersection of the predicted segmentation mask (Pr) and the ground-truth segmentation mask (Gr) to their union. It ranges from 0 to 1, with a value of 1 indicating a perfect segmentation, given as:

$$\begin{aligned} {\text {IoU}}=\frac{{\text {Pr}} \cap Gr}{{\text {Pr}} \cup Gr} \end{aligned}$$
(14)

Mean IoU (mIoU) is the average of the IoU scores across all classes and provides an overall measure of the model’s performance. The formula for mIoU is as follows:

$$\begin{aligned} {\text {mIoU}} = \frac{1}{c} \times \sum {{\text {IoU}}_i} \end{aligned}$$
(15)

where c is the number of classes and \({\text {IoU}}_i\) is the IoU score of the ith class, calculated using the formula in (14). Another popular metric is the Dice coefficient (DC), which is calculated as the ratio of the intersection of the predicted and ground-truth segmentation masks to the sum of the sizes of the masks. It also ranges from 0 to 1, with a value of 1 indicating perfect segmentation.

$$\begin{aligned} DC = 2 \times \frac{overlapping \; pixels}{total \; pixels \;in \;both\; Pr \;and\; Gr} \end{aligned}$$
(16)

Furthermore, accuracy is used as an additional metric to evaluate and compare our model with the baseline/reference models.

Mean average precision (mAP) is commonly used in object detection and measures the average precision (AP) across different object classes, providing an overall measure of the model’s performance. To determine the mAP, the AP for each class is computed and then the APs are averaged across all classes. The AP score for a specific class evaluates the model’s P and recall (R) when detecting instances of that class. It is determined by creating a P–R curve and taking the average of the P values at various R levels.

$$\begin{aligned} {\text {AP}}= & {} \int _0^1 P(R) {\text {d}}R \end{aligned}$$
(17)
$$\begin{aligned} {\text {mAP}}= & {} \frac{1}{ \text{ cL } } \sum _{i=0}^{\text{ cL }} {\text {A P}}_i \end{aligned}$$
(18)

By employing these evaluation criteria, we ensure a comprehensive and rigorous assessment of our frameworks’ performance in semantic segmentation and object detection tasks. These metrics allow for fair comparisons with other state-of-the-art networks.

3.2 Evaluation of plant detection model

In this study, we evaluate the performance of different object detection models in the YOLO pipeline by comparing their results on our dataset of corn crops. To ensure a fair analysis, we use consistent hyperparameter settings, including a learning rate of \(1 \times {10}^{-2}\), a batch size of 16, and 120 training epochs for all models. During training, we utilize the Adam optimizer and binary cross-entropy loss function for all models. According to Table 2, YOLOv7 outperforms all other models with the highest precision and mAP values, which are 5.06% and 1.87% higher than YOLOv5. While YOLOv3 has the highest recall value of 0.902, its mAP value is lower than YOLOv5 and YOLOv7. Among all the models evaluated in this study, YOLOX has the worst performance.

Table 2 Comparing the accuracy and efficiency of various YOLO object detection architectures

Table 3 compares networks used to identify and count corn stands in both RGB and UAV imagery. While semi-supervised and supervised DL methods have been successful in detecting and counting early-stage corn crops, their effectiveness on mid-stage crops has not been reported yet [2, 13]. The table highlights that the faster R-CNN method integrated with VGG19 achieves an impressive 97.71% precision rate for detecting corn crops. However, its performance decreases as the crops grow and in varying lighting conditions [49]. On the other hand, our YOLOv7-based corn plant stand counting approach achieves high precision, outperforming other networks used for the same task.

Table 3 Comparing various methods for detecting and classifying corn crops: an in-depth performance analysis

In Fig. 11, we compare the actual manual crop count in test Fields A and B at different growth stages with the predicted count obtained from YOLOv7. We use two NMS post-processing techniques: NMS and DIoU-NMS. We evaluate 38 examples randomly selected from each field. Field A primarily consists of mid-stage crops, while Field B represents early-stage crops. The number of crop stands in Field A is higher than in Field B.

The results demonstrate that the correlation between the manual and predicted count using the DIoU-NMS is stronger than the NMS. We observe similar trends in Field B, but with slightly higher variations between the actual and predicted count using both techniques. This is due to the higher crop emergence at later growth stages with overlapping crops and the resemblance of weeds to the narrow structure of the leaves. Despite this, the correlation between the actual and predicted counts using DIoU-NMS post-processing is superior. It achieves a score of 1.400 for RMSE, which is 2.3% better than the RMSE achieved by the model when utilizing the NMS technique. The \(R^2\) value using the DIoU-NMS is 0.985.

Fig. 11
figure 11

Comparison of counts from NMS and DIoU-NMS with manual counts

3.3 Crop row extraction networks

Based on corn fields in Table 4, the effectiveness of the proposed SP-UNet for extracting crop rows is evaluated using existing methods. However, a modified technique using the Hough transform yields only a 98% correct detection rate on a limited dataset of 100 images, which is insufficient for determining overall performance in different scenarios [51]. Furthermore, this technique is sensitive to noise and low contrast in the data. Another method, mask scoring R-CNN, has a high mAP of 82.8%, but its accuracy decreases with lower spatial resolution [52]. Although the Adaptive Multi-ROI method [53] has a high accuracy of around 95%, it is computationally more expensive than the proposed SP-UNet.

Table 4 Comparison of existing row detection models with our proposed SP-UNet on corn field

3.4 Evaluation of segmentation models and their comparison with SP-UNet

The performance of various models is compared in Table 5 using several evaluation metrics, including mIoU, validation accuracy, and the number of trainable parameters for each model. Among all the models, DeepLabV3+ with ResNet50 has the lowest number of trainable parameters and is computationally faster than the others. However, it has a lower mIoU score of 2.6% and 2.4% compared to HRNet_Mscale and SP-UNet, respectively. SegNet50, which is SegNet with a ResNet50 backbone, has the lowest IoU score of 0.739 for crop rows and requires 58.6% more parameters than SP-UNet. Although HRNet_Mscale has the best mIoU score, it is computationally complex and requires many parameters. Additionally, its high validation accuracy is not a reliable metric for evaluating segmentation models due to imbalanced class data. In contrast, the SP-UNet model has better crop and row IoU scores than HRNet_Mscale and uses 239% fewer parameters. Its faster processing speed and minimal impact on IoU scores make it a more suitable choice for various applications. Furthermore, The SP-UNet obtains the second-best mIoU score, employing only 21.85M parameters.

Table 5 Comparison of segmentation models with our proposed model on corn field

The SP-UNet is trained using the Adam optimizer and a learning rate of \(1 \times 10^{-2}\). The accuracy and loss graphs for the train and validation sets are shown in Fig. 12. In Fig. 12a, it is evident that the accuracy fluctuates frequently throughout the training epochs. The accuracy for both sets reaches a cut-off point at approximately the 40th epoch where it reaches 95%. After this point, the validation accuracy follows a steady trend, while the training accuracy shows sporadic fluctuations. As the training continues, the model’s performance on the validation dataset improves, reaching an accuracy of 96.5%. At the same time, the training accuracy reaches 98.6%. The loss graphs in Fig. 12b show that the model generalizes well on unseen data. This is supported by the rapid convergence of the training and validation loss, which eventually reaches values of 0.04 and 0.02, respectively, at the 72nd epoch.

Fig. 12
figure 12

Demonstration of different labeling techniques

The picture shown in Fig. 13 compares various segmentation models for crop row extraction and crop segmentation used for LAI estimation. This analysis is being conducted during the early and mid-stages of crop growth, and the results are being demonstrated using two randomly selected images from each stage. The first column displays the test images, the second and third columns show the output of crop and row segmentation, respectively, and the final column shows the center-line association with the crop rows. Among the segmentation models being tested, the SegNet50-based model performs the worst, with incorrect row segmentation leading to false positives during early-stage crop growth and misaligned rows during mid-stage growth. Additionally, it is missing some crop pixels, leading to incorrect LAI calculations. The UNet architecture performs better in segmenting rows but misclassifies some background/weed pixels as crops, leading to incorrect row alignment. It also has small misses in crop segmentation, especially in poor field conditions. Our proposed SP-UNet model effectively overcomes these challenges and accurately segments both rows and crops, which is beneficial for LAI and determining the distance between crops within the predicted rows.

Fig. 13
figure 13

Demonstration of different CNN models for segmenting crops and rows in corn crop images, and the final center-line association to the rows

3.5 Quantification of crops’ consistency

Extensive analysis is carried out on corn cropland spanning a thousand acres to determine the correlation between crop emergence percentage, average inter-plant spacing, and variance. The field area has been presented in a heatmap format using the ArcGIS platform [54], as shown in Fig. 14. This online mapping application utilizes KMZ (Keyhole Markup Language Zipped) files to analyze geographical data. KMZ files are compressed folders containing all the necessary files and KML (Keyhole Markup Language) format files representing the location coordinates. These location coordinates are represented in XML notation. The left-most map demonstrates the average plant spacing obtained through distance estimation between consecutive plant stands. The middle map showcases the proposed SP-UNet model’s crop segmentation, while the right-most image displays the variance obtained from the plant spacing. The field heatmaps exhibit a significant degree of dispersion, indicating a considerable variation in crop spacing. A correlation between mean plant spacing and variance can be observed in the left and right maps, pinpointing key locations with high or low spacing. Further examination of these patterns can be conducted through crop segmentation techniques. Regions with high crop density display a lower variance, whereas areas with low crop density exhibit a higher variance. At an intermediate level of crop density, the dispersion appears to be moderate.

Fig. 14
figure 14

Heatmaps of average plant spacing, variance, and crop segmentation overlaid on top of an RGB satellite image background

The LAI value provides information about the percentage of crops in an image, while the CoV is derived from the SD and mean spacing. These variables have a strong correlation that can give insight into the consistency of crops. The correlation between them is established numerically in Table 6, and a consistency determinant is assigned based on these quantifications. To assess consistency, eight images are randomly selected from the test data. The table shows that as the spacing variance increases with the CoV value, the LAI decreases due to missed crops. This suggests an inversely proportional relationship. When CoV is around 30%, and LAI is around 80%, the crop is uniform across the area with minimal sparsity in the rows. When CoV is around 70%, and LAI is around 25%, the crop is either in a stressed condition or the field has very low-density areas. The expression "moderate" indicates the acceptable range of variations in crop consistency. As CoV increases, the estimation of crop LAI tends to decrease, as observed in Fig. 15. Comparing Image 1, which shows good consistency in crop spacing and an LAI of 0.802, to Image 8, which exhibits poor consistency and a 261.3% lower LAI value. While the plant stand count is a valuable metric, it is insufficient to evaluate crop consistency fully. This becomes apparent when observing Images 6 and 8, where one section of a row contains multiple crops close together, while the other crops in the next row are widely scattered. The CoV value is helpful in these circumstances. Utilizing the CoV with LAI can provide a more comprehensive understanding of crop health. A lower LAI value can indicate stress in the crops.

Table 6 Establishment of the correlation between the crop variables with a consistency determinant to evaluate crop consistency
Fig. 15
figure 15

Representation of the consistency of crop images, showing examples of both well-consistent and poorly consistent crops

4 Conclusion

This research presents a novel vision-based framework—QuanCro to assess crops’ consistency by estimating four crop parameters; plant stands count, inter-plant distance, coefficient of variation, and crop coverage in corn fields. Accurate estimation of these parameters is crucial for operational decisions, such as replanting and post-emerge herbicide applications, and assessing spatial production variability. QuanCro utilizes RGB imagery captured using a top-mounted camera on the field machinery. It uses the state-of-the-art object detector YOLOv7 to determine the number and location of plant stands. Additionally, this framework uses our proposed semantic segmentation network, SP-UNet encoder–decoder architecture, for segmenting crops and crop rows. The SP-UNet has fewer parameters (21.85M) than other semantic segmentation models such as SegNet (34.65M), HRNet_Mscale (72.10M), and Simple UNet (31.04M). Despite having fewer parameters, the SP-UNet achieves a higher segmentation of mIoU 0.913 with IoU scores of 0.94 and 0.924 for crop and crop rows segmentation, respectively. The segmented rows are further processed using the Zhang–Suen thinning and progressive probabilistic Hough transform to assign a line to each row’s center. The segmented crop provides leaf area index estimation.

This whole framework is promising, and it holds significant potential for future research applications in assessing the consistency of various crop types and adapting it to diverse agricultural settings. One area of interest for upcoming studies is exploring the integration of alternative sensor modalities, such as multispectral or hyperspectral imaging, to enhance QuanCro’s adaptability and robustness in different field conditions. It is also crucial to acknowledge the limitation of this proposed framework, particularly when dealing with late growth stages of crops. At this stage, individual plants tend to cluster closely together, posing a challenge for the framework’s object detector to detect and distinguish individual plants accurately. The overlapping nature of these clusters can lead to erroneous inter-plant spacing calculations, ultimately affecting the overall assessment of crops’ consistency.