Introduction

Pumpkin, a member of the Cucurbitaceae family, is a medically and economically golden plant species [1]. According to FAO data, 24 million tons of pumpkin are produced annually in the world. While China ranks first in world pumpkin production, it is followed by Ukraine, Russia, America, Spain, and Turkey [2]. Snack food, pumpkin, and squash varieties are grown for commercial purposes [3]. The meat part of the squash is used in soups, vegetable dishes, cakes, desserts, and confectionery [4]. While the seeds are consumed as snacks [5], the waste parts are used in animal nutrition [6]. In addition to being consumed fresh or roasted, pumpkin seeds are used as a food supplement, in salads, meals, and sauces, in the pharmacological field, in the production of cosmetic products, in the production of soap and candles, by obtaining oil from the seeds [3, 7].

The use of pumpkin-containing drugs rich in omega-3, fatty acids (linoleic acid, palmitic acid, oleic acid, and steric acid), zinc, and selenium draws attention worldwide [8,9,10]. β-carotene, which has an anti-aging effect, strengthens immunity and prevents the formation of tumors and cataracts, is abundant in pumpkin seeds [11, 12]. Thanks to these unsaturated fatty acids, it strengthens memory, prevents cancer, and plays an active role in reducing inflammation in the body [13, 14]. It is a rich source of protein, lutein, phenolic compounds, vitamins B1, B2, and C, α-tocopherol (vitamin E), and nutrients (Mg, K, Fe, Na, Se, P, Zn, and Mn) [15,16,17].

The physical properties of agricultural products (such as shape, size, sphericity, surface area, bulk weight, moisture content, porosity, specific gravity, color, and mass) are important in terms of gaining consumer appreciation and post-harvest technologies [18,19,20]. Customers prefer products that look healthy and regular in shape, color, and size [21]. As the moisture content of the seeds increases, the breaking strength decreases. Friction coefficient, porosity, and axial dimension increase [22]. The size and shape data of seeds provide convenience in the design and manufacture of standard packages [23]. In addition, the shape and size characteristics of seeds are considered in the design of sorting and grinding machines [24,25,26]. The physical attributes of pumpkin seeds should be known for the design of equipment that will help from planting seeds to post-harvest processing and marketing [27]. These measures take a lot of time and effort. To solve these issues, novel technologies have been created. Development technologies might be easily and quickly identified, classified, and sorted. To describe the features employed in the quality assessment of seeds, however, such pragmatic techniques are required.

Artificial intelligence is the approach that imitates the human brain and can make decisions and finalize the process in the new formation by transferring human characteristics [28]. Machine learning is the performance of a specific task through the acquisition and interpretation of extensive data by computer systems. With the advantage of machine learning, it is possible to efficiently categorize samples [29]. Machine learning uses multi-layered mathematical operations to learn and manipulate complex data. It is also modeled by mimicking the human brain [30]. Classification processes are carried out by processing data through machine learning algorithms. Machine learning is mostly implemented using neural networks, trees, and support vector machines [26, 31].

Many studies were performed to detect only the mass, size, and shape attributes of pumpkin seeds [18, 23, 32,33,34]. However, there are limited numbers of studies about shape and size-based classification of Cucurbitaceae. Generally, classification studies related to pumpkin seed [35, 36, 53,54,55] and watermelon seed [37,38,39]. However, literature reviews presented that there were no studies on the binary classification of pumpkin seeds using machine learning models. The novelty of this study is related to binary classification of the pumpkin seeds based on similar physical attributes by machine learning and analytical methods. The aim of the study was to develop binary classification models by five machine learning algorithms (NB, SVM, RF, MLP, and kNN) for the classification of four different pumpkin seed varieties based on mass, shape, and size.

Materials and methods

Plant material and sample preparation

In this study, seeds of four pumpkin varieties (Develi, Sena Hanım, Türkmen, and Mertbey) were used as the plant materials. Pumpkin seeds were harvested on 16 September 2021 from Develi District (38° 16′ 25.7″ N, 35° 25′ 03.1″ E) in Kayseri province of Turkey. Deformed, dirt and hollow seeds were removed before analysis and preserved at 4 ± 0.5 °C throughout the analysis.

Shape and dimension measurements

The mass of the products was measured by classical methods with the use of a precise electronic scale (± 0.001 g), and principal physical properties such as length (L, mm), width (W, mm), and thickness (T, mm) were determined by instrumental methods using a digital caliper (± 0.01 mm). For mass, shape, and size, 100 pumpkin seeds were handled from each variety. Size (geometric mean diameter, Dg, mm; volume, V, mm3; projected area, PA, mm2 and surface area, S, mm2) and shape (aspect ratio, AR; elongation, E; roundness, R; shape index, SI; and sphericity, φ, %) attributes were found using equations given in Table 1. The flow chart of the binary classification of pumpkin seed varieties by machine learning is presented in Fig. 1. These stages consist of determining size, shape, and mass attributes, implementing feature selection, performing cross-validation, binary classifying by machine learning, and evaluating performance metrics.

Table 1 Size and shape equations used in the calculations
Fig. 1
figure 1

The flow chart of the binary classification of pumpkin seed varieties by machine learning

Multivariate analysis

Experimental data were evaluated in one-factor analyses, and Tukey’s multiple comparison test was utilized to evaluate significant means (p < 0.05). Linear discriminant analysis was used to evaluate differences between the variations. The discriminant analysis variety group centroids was applied to create a scatter plot. The principal components (PCs) were evaluated for multivariate tests. Hotelling’s pair-wise comparisons with Bonferroni correction and squared Mahalanobis distances were used to determine whether pumpkin seed varieties were similar or different from one another. Software versions PAST v3.20 [40] and SPSS v20.0 [41] were used to conduct statistical analyses.

Feature selection, validation methodology, and classification

Weka® v3.8 software was used to apply a classification strategy of machine learning models [42]. Five machine learning classifiers were run on a computer with an 8 GB memory and core i7 CPU running at 4.2 GHz. The primary physical characteristics served as the basis for the machine learning classification of variations. Machine learning algorithms used the primary physical characteristics to categorize different pumpkin seed varieties. The classification of pumpkin seed varieties using machine learning models was based on the main physical attributes. Mass, length, geometric mean diameter, and shape index were used as the criteria for classifying because these attributes have been selected by CFS attribute selection. 100 pumpkin seed samples were determined for each attribute. Total sample size was 5200, and a total of 1300 were used for each variety. The k-fold cross-validation method was applied for model performance evaluation. The k value was chosen as 10 since the current data set had 10 sub-sets. Training processes were utilized in the 10 iterations. One sub-set was used for testing and the other subsets (9 sub-sets) were used for training, in each iteration. Each k sub-sample was utilized once for testing, respectively [43]. The k-fold cross-validation procedure is presented in Fig. 2.

Fig. 2
figure 2

Ten fold cross-validation methodology

Machine learning approaches

The model development was performed on a variety of datasets (inputs), including physical attributes, such as mass, length, geometric mean diameter, and shape index. A total of 400 data, 100 from each attribute, were used for each binary analysis. The models were created using different algorithms from the groups of Random Forest (RF), Support Vector Machine (SVM), Naïve Bayes (NB), Multilayer Perceptron (MLP), and k-Nearest Neighbors (kNN) in a test validation mode of ten fold cross-validation. In this study, the Chebyshev distance rule with the LinearNN Search was performed in the search process in the k-NN method, and the k values were 5. SVM was decided upon Pearson VII (PUK) kernel function. The numbers of neurons in input, hidden, and output layers were all 4-3-2 ANN structures in the binary classification of the pumpkin seed varieties. The numbers of epochs, learning ratio, momentum coefficient, and activation function were chosen as 500, 0.3, and 0.2, and the sigmoid function in all MLP classifications, respectively. The MLP model structure is given in Fig. 3, and detailed information about the ML models is provided in Fig. 4.

Fig. 3
figure 3

MLP structure model

Fig. 4
figure 4

Detailed information on ML models

The outcomes include accuracies for each pair and confusion matrices for the pairs of four kinds of pumpkin seeds. In addition, accuracy (Ac), F-measure (F), precision (P), ROC (Receiver Operating Characteristic) area, and PRC (precision–recall) area. Performance indices were determined by Eqs. (1), (2), and (3) [44].

$$A_{{\text{c}}} = \frac{{{\text{TP}} + {\text{TN}}}}{{{\text{TP}} + {\text{FP}} + {\text{TN}} + {\text{FN}}}} \times 100$$
(1)
$$F = \frac{{2 \times P \times S_{{\text{e}}} }}{{P + S_{{\text{e}}} }}$$
(2)
$$P = \frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FP}}}}$$
(3)

TN = Number of true negatives, TP = Number of true positives, FN = Number of false negatives and FP = Number of false positives.

To compare the results of various categorization schemes using statistical metrics, the information provided by the ROC curve must be condensed into a single response variable [43, 45]. Because it falls between 0 and 1 and facilitates comparisons between classifiers, the region under the complete ROC curve was proposed as a suitable performance metric [46]. A threshold that perfectly separates exists when the ROC area value is close to 1, which indicates that most positive class samples have been given scores higher than any non-class samples.

Results and discussion

Seed physical attributes

Size, shape, and mass attributes of four pumpkin seed varieties were obtained, and binary classification was utilized for varieties. The results of the physical attributes are tabulated in Tables 2 and 3. In this study, all physical attributes were found to be significant (p < 0.01). The Sena Hanım variety had the greatest mass with the value of 0.23 g, while the Türkmen had the lowest mass (0.18 g). The highest volume and the length were determined from Sena Hanım (V: 363.61 mm3 and L: 21.43 mm) and Mertbey (V: 357.44 mm3 and L: 22.08 mm) varieties. The greatest and the lowest thickness values were found as 3.03 and 2.83 mm from Mertbey and Develi varieties, respectively. The greatest geometric mean diameter was projected, and surface area values were determined from Sena Hanım (Dg: 8.79 mm and SA: 244.63 mm2) and Mertbey (Dg: 8.74 mm and SA: 241.72 mm2).

Table 2 Size and mass attributes for pumpkin seed varieties
Table 3 Shape attributes for pumpkin seed varieties

Develi variety had the highest sphericity (42.28%) and roundness (0.19). An almost spherical seed form is indicated by roundness values close to 1. However, the lowest sphericity and the roundness were obtained Mertbey with the values of 39.73 and 0.16, respectively. The greatest shape index was obtained from Mertbey (3.40) variety while the lowest one was obtained from Develi (2.99) variety. All varieties were defined as oval, as the shape index was above 1.25. The Türkmen variety had the highest aspect ratio value as 0.15. The greatest elongation values were found from Mertbey, Sena Hanım, and Develi with the values of 7.35, 7.31, and 7.14, respectively. Türkmen had the lowest elongation with the value of 6.72. With decreasing sphericity and roundness, increasing shape index values were seen in Çetin et al. [56] who obtained similar results as well.

Complying with the results, Devi et al. [33] indicated mean length, width, and thickness values of pumpkin seeds as 16.81, 8.87, and 2.75 mm, respectively. In addition, geometric mean diameter and single seed weight attributes were found as 7.42 mm, and 0.20 g, respectively. Khodabakhshian et al. [32] investigated main shape and size attributes of pumpkin seeds at different moisture contents (4%, 8%, 14%, and 20%) and varieties of Zaria and Gaboor. Authors reported length width, thickness, diameter, and sphericity attributes changed between 14.90 and 17.55 mm, 6.91 and 8.93 mm, 3.05 and 4.95 mm, 7.18 and 9.45 mm, 0.54 and 0.53 for Zaria variety, and 15.86 and 18.96 mm, 5.17 and 7.94 mm, 2.92 and 4.69 mm, 6.38 and 9.11 mm, and 0.45 and 0.53 for Gaboor variety, respectively. Contrary to the findings, Priyadarshini et al. [34] handled seed length, width, thickness, elongation (L/T ratio), and single seed weight of 12 different cucumber genotypes and reported grand mean values as 11.10 mm, 4.60 mm, 2.52 mm, 4.36, and 0.28 g, respectively. Results differences may be due to the product species differences. Paksoy and Aydin [23] found length, width, thickness, geometric mean diameter, volume, sphericity, and mass of pumpkins seeds to be 18.16 mm, 9.80 mm, 2.67 mm, 7.72 mm 43.0%, 0.73 cm3, and 0.29 g, respectively. Similar findings were also reported by Altuntaş et al. [18] for pumpkin seed length, width, thickness, geometric mean diameter, sphericity, surface area, single volume seed, and unit seed mass with the values of 19.92 mm, 11.30 mm, 3.22 mm, 9.71 mm, 60.55%, 2.54 cm2, 0.11 cm3, and 0.21 g, respectively. These differences were primarily attributed to varieties, climate conditions, and moisture contents [47].

Discrimination analysis

Linear discriminant analysis for physical attributes of pumpkin seed varieties is shown in Table 4. The more dependent variables the function describes, the higher the eigenvalues. In the study, eigenvalues were determined as 0.490, 0.381, and 0.042 for functions 1, 2, and 3, respectively. The effect size of the functions is explained by the square of the correlation. The first two functions explained 95.4% of the total variation as 53.7% and 41.7%, respectively. The best estimation is explained by Wilks’ lambda. Wilks’ lambda ideal was significant for each estimative estimator, and in the case of the current investigation, it was significant for three outcomes. The unexplained portion of the differences between the groups was determined to be 46.6% in Wilks’ lambda statistics. Eight estimators’ relative relevance was determined by the discriminant function coefficients. The chi-square value was found as 299.742 for functions 1–3. Geometric mean diameter and length were discovered to have the highest loadings in function 1 according to the loadings. The shape index and the sphericity in function 2 showed the most significant loadings.

Table 4 Discriminant analysis results

Group centroids of four different varieties based on their canonical discriminant functions are displayed in Fig. 5. Differences between components, geometric mean diameter, length, shape index, and sphericity attributes were taken into account as significant discriminate attributes. For the Sena Hanım and Türkmen varieties, length, and geometric mean diameter proved the discrimination analysis in the canonical function 1. The sphericity, shape index, and roundness attributes for the Develi and Mertbey varieties confirmed the position in the canonical function 2 axis.

Fig. 5
figure 5

Scatter plots of the pumpkin seed varieties from the point of the discriminant scores and group centroids function 1 and 2 (*Develi: Darkblue, Sena H.: Cadetblue, Türkmen: Crimson, Mertbey: Golden. **In the figure, the parts with the variety names are the group centroids.)

Pair-wise comparison and multivariate tests

Statistics using Hotelling Trace, Pillai Trace, and Wilks’ Lambda revealed that all varieties of physical attributes were significant (p < 0.01). Table 5 provides MANOVA, Bonferroni corrected, and Mahalanobis distance values. The percentage of variance in dependent variables was represented using Wilks’ Lambda statistics, which was then explained by variations in independent variables. The Wilks’ Lambda statistic, which is smaller, reveals that the differences between the groups in the study increased and varied from 0 to 1. The sum of variances, which explains the most discrimination of independent factors in dependent variables, is considered by the Pillai trace statistics, which is regarded as the most reliable among multivariate analyses. In the study, Pillai’s trace, Wilks’ Lambda, and Hotelling trace values were obtained with the values of 0.752, 0.405, and 1.105, respectively. Cetin et al. [20] found that variations with a Mahalanobis distance of less than 3 exhibit remarkably similar characteristics. The Develi and the Sena Hanım varieties with the smallest Mahalanobis distances shared similar characteristics. The greatest value was found in the distance between the Sena Hanım and the Türkmen varieties, and the varieties showed different characteristics. Additionally, Bonferroni corrected p values supported these findings.

Table 5 Differences among the pumpkin seed varieties

Performance results of binary classification

Binary variety classification of pumpkin seeds was performed for variety pairs. Five machine learning techniques (RF, SVM, NB, MLP, and kNN) were used to generate classification models for size, shape, area, and mass attributes in each pair scenario. All five classifiers were able to achieve classification accuracies that were only fairly adequate in the case of the model based on the physical attributes of pumpkin seeds for Develi and Sena Hanım (Table 6). The MLP gave a high accuracy of 73.00%, while the RF had the lowest accuracy of 70.00%. These findings were also validated by the values of other performance metrics. TP rate, Precision, F-measure, PRC area, and ROC area were 0.740 and 0.690, 0.705 and 0.726, 0.722 and 0.708, 0.652 and 0.656, and 0.715 for Develi and Sena H., respectively. For Develi and Türkmen pairs, the greatest accuracy value was obtained in the MLP algorithm (72.00%). kNN algorithm had the lowest accuracy value of 65.50%. In the case of pumpkin seeds of the Develi and the Türkmen varieties, classification accuracies for both classifiers were rated as slightly lower and yet still acceptable. In the study, Develi and Mertbey pairs had the greatest classification accuracies among the variety pairs. MLP algorithm had the greatest accuracy value of 85.00%. In the MLP algorithm with the highest accuracy, TP rate reached to 0.790 for Develi and 0.910 for Mertbey, Precision to 0.898 for Develi and 0.813 for Mertbey, F-measure to 0.840 for Develi and 0.858 for Mertbey, PRC Area to 0.894 for Develi and 0.896 for Mertbey, and ROC area to 0.907 for both varieties (Table 6). MLP was followed by SVM and RF with the values of 84.50% and 83.50%, respectively.

Table 6 Performance metrics and confusion matrices for Develi variety

Seeds of Sena Hanım and Türkmen varieties were discriminated by five algorithms with accuracy values of between 77.50% and 84.50%. Herein, it was observed that seeds were classified with 84.50% percent accuracy in the confusion matrices despite the fact that MLP was the most successful algorithm (Table 7). The lowest accuracy (77.50%) was found in the RF algorithm. The ROC area value of the Sena Hanım and Türkmen varieties was obtained as 0.904. According to the classification performance, the next pair was Sena Hanım and Mertbey, and the highest accuracy values were observed in MLP (74.50%). In the MLP algorithm, for Sena Hanım, TP ratio, F-measure, Precision, PRC area, and ROC area reached the following values: 0.750, 0.743, 0.746, 0.803, and 0.776, respectively. These values were determined as 0.740, 0.747, 0.744, 0.803, and 0.804 for Mertbey, respectively (Table 7).

Table 7 Performance metrics and confusion matrices for Sena H. variety

The pair of the Türkmen and Mertbey varieties were found to have a classification accuracy of more than 87.00%. The SVM model yielded an accuracy of 82.50% in the binary classification. The performance metrics for Türkmen and Mertbey were 0.810 and 0.840 (TP rate), 0.835 and 0.816 (Precision), 0.822 and 0.828 (F-measure), 0.825 (ROC area), and 0.771 and 0.765 (PRC area), respectively (Table 8).

Table 8 Performance metrics and confusion matrices for the Türkmen and Mertbey varieties

Each pumpkin variety’s separate ROC area curve was shown for all models created using all size and shape attributes (Fig. 5). The predictive model’s efficacy is graphically represented by the receiver operating curve, which demonstrated that the classifier correctly classified the varieties. The MLP and the SVM algorithms produced the largest ROC area values, as was to be expected. Because the values obtained are larger, the ROC area values ensure very excellent performance for automatic identification of any understudy of the variety classification. As seen in Fig. 6, the ROC area curves, the best classified soybean variety pair, were Develi vs Mertbey, Türkmen vs Mertbey, and Sena Hanım vs Türkmen. Here, the worst classified pair was determined as Develi vs Sena Hanım, Develi vs Türkmen, and Sena Hanım vs Mertbey.

Fig. 6
figure 6

ROC curves of classified pairs based on selected physical attributes

MLP and RF showed an excellent ability to classify among the variations in order to maximize the distance between groups and minimize the distance between classes. The MLP accuracy values for these varieties were very promising. In addition, it has been revealed that the SVM algorithm also comes to the fore in this study. Pumpkin seeds are very similar to each other due to their structure and physical attributes. For this reason, the fact that the accuracy values obtained are medium–high encourages future studies. Within the scope of findings, studies that are compatible and have similar or different aspects are clearly presented.

Similarly, Demir et al. [35] used the Radial Basis Neural Network (RBNN) and Propagation Neural Network (BPNN) to predict the physical attributes of the pumpkin seeds and reported RMSE values as 0.0025 and 0.6875 for RBNN and BPNN, respectively. The authors also mentioned its superiority in RBNN structure prediction and that these algorithms could be an alternative approach to the traditional methods. Koklu et al. [36] determined the physical attributes of two pumpkin seed varieties as “Ürgüp Sivrisi” and “Çerçevelik” and classified them using algorithms such as LR, MLP, SVM, RF, and k-NN, and authors indicated accuracy values of the models 87.92, 88.92, 88.64, 87.56, and 87.64, respectively. The reason why these results are higher than our findings is due to the structure of the selected varieties. So that the “Ürgüp Sivrisi” variety has a more oval shape, while the “Çerçevelik” variety has a round shape. Li et al. [53] classified pumpkin seeds by convolutional neural network and hyperspectral imaging technology. The authors indicated that PA-3DCNN had the greatest accuracy than the other algorithms with values of 99.14% and 95.24% for training and test sets, respectively. In addition, the accuracies were changed between 65.18% and 99.14% for eight different models. Prasad et al. [54] implemented and designed machine learning models that included LR, SVM, DT, NB, ANN, and kNN for the classification of pumpkin seed varieties and obtained average accuracies of 99.81%, 52.20%, 100.00%, 52.00%, 95.80%, and 77.20%, respectively. They reported that DT had the best results and could be effectively used for the classification of pumpkin seeds. Gulzar et al. [55] proposed a system of classification of 14 different seeds (sunflower, onion, mustard, kidney beans, flax, fenugreek, black eyed peas, black pepper, chickpea, coriander, corn, cumin, fennel, and pumpkin) using machine learning and deep learning. The results showed that classification accuracy reached 99% for the test set. Since these seeds are of different types separated from each other, the results are quite high. However, the lowest results were obtained in pumpkin seeds. Liu et al. [37] applied LS-SVM, BPNN, and RF algorithms to discriminate watermelon seeds. According to spectral + morphology features for Julong variety, LS-SVM, BPNN, and RF results were found as 92%, 84%, and 87%, while these values for Xiali variety were found as 83%, 75%, and 91%, respectively. Mukasa et al. [39] classified triploid watermelon seeds from diploid and tetraploid seeds. Authors created a classification model with ML techniques by one-class classification using SVM quadratic and DD-SIMCA models. The SVM quadratic and the DD-SIMCA models yielded triploid accuracies of 84.3% and 69.5%, respectively. Ahmed et al. [38] evaluated deep learning and conventional machine learning methods for the classification of watermelon seeds by morphological patterns. The authors indicated accuracy values of 87.3% and 83.6% for ResNet-50 and LDA algorithms, respectively. The findings showed that classification based on physical attributes could be performed using machine learning algorithms. The attributes and the algorithms studied have proven their usability by giving successful results in many similar studies.

Conclusion

The effectiveness of machine learning was demonstrated to discriminate pumpkin seeds in terms of physical characteristics. For classification models, the data were prepared through a series of preprocessing and then datasets and models were created with selected features (mass, length, geometric mean diameter, and shape index). Using these datasets, MLP and SVM from machine learning algorithms became the most successful models. In addition, the varieties with the highest accuracy values were Develi and Mertbey, while the less-accuracy values were Develi and Turkmen. The practical importance of the study is the classification of seeds with very similar characteristics correctly and quickly using the machine learning technique. In addition, accurately classifying pumpkin seeds that meet specific criteria is crucial for food and agricultural industries. Based on the present findings, a new approach could be suggested as a valuable control tool in development of planters for the agricultural machinery, breeding research, and the seed industry. In this study, we encountered some limitations and had suggestions for future research. One limitation was the time-consuming process of measuring shape, size, and mass attributes. To overcome this, we recommend using modern techniques like image processing with affordable, yet effective hardware, such as webcams, action cameras, or mobile phone cameras. Furthermore, future studies can expand by incorporating more data sets, attributes, and algorithms.