Introduction

Sunflower is native to North America and always attracted the attentions of scientist as an important industrial crop (Badouin et al. 2017). In evolutionary biology, the genus Helianthus is a long-standing model of hybrid traits (Rieseberg et al. 1995) and adaptive introgression (Vandenbrink et al. 2014). It is also seen as a model for tracking the sun and understanding the development of flowering in plant science (Tähtiharju et al. 2011). Sunflower seeds are rich in oil (40–50%) (Demir and Basalma 2018) and oleic acid (30%) (Avni et al. 2016). Sunflower seeds also have high polyunsaturated and unsaturated fatty acids and high linoleic acid contents (Kiriamiti et al. 2002; Salgin et al. 2006). The fact that sunflower was responsible for about 87% of vegetable oil production makes it more prominent than the other oilseeds. Sunflower seed is also rich in minerals, vitamins, flavonoids, glycosides, phenolic acids, alkaloids, tannins, caffeic acid and sterols etc. (Adeleke and Babalola 2020). The seeds constitute an important source of protein and offer a balanced source of amino acids (Kiriamiti et al. 2002; Salgin et al. 2006). Therefore, they play an important role human nutrition and health. In general, 90% of sunflower seeds is used for oil production and the rest for fresh consumption as an appetizer (Khodabakhshian et al. 2009).

There are many different engineering traits that distinguish agricultural products. Shape and size properties (dimensions, area, mass, elongation and spherical parameters) are important quality indicators of agricultural products. These features should be well-known while designing machinery and systems used in different processes such as classification and quality assessment of agricultural products (Costa et al. 2011). The physical properties of sunflower seeds should also be well-known while designing various equipment such as cleaner, grader, dehuller, separator, and oil expeller (Gupta et al. 2007).

Size, mass and shape attributes can be determined for a single seed. However, the frequency distributions of these properties should be defined for all sets of seeds when designing the relevant equipment (Khazaei et al. 2008). In order to minimize the number of damaged and defective seeds, reduce the power consumption of threshing machines and increase the efficiency of the machines, the relevant machines and classifiers should be so designed and optimized as to comply with the physical properties of the seeds (Mirzabe et al. 2012). Shape, size, mass and biochemical attributes designate the quality of seeds for various purposes (Sorkheh et al. 2018). Product quality is characterized as a combination of the texture, flavor and visual appearance of the fruit, and consumers often shape their preferences based on these parameters (Kays 1999).

Machine learning-based product classification applications are effective tools used in design of accurate and reliable classifiers. Such applications include various algorithms such as decision tree, artificial neural network, genetic algorithm, regressions and fuzzy logic. Additionally, there are powerful algorithms for training of various machine learning models and for adapting difficult input-output mapping strategies as well as selecting and removing useful features. These models are generally used for correct selection of descriptive features in the quality assessment of agricultural products (Omid et al. 2010; Mollazade et al. 2012).

Several researchers studied size and shape attributes of sunflower seeds (Gupta and Das 1997; Santalla and Mascheroni 2003; Khodabakhshian et al. 2010; Mirzabe et al. 2012; Jafari et al. 2011; Malik and Saini 2016; Munder et al. 2017; Ortiz-Hernandez et al. 2020), mineral composition (Ingale and Shrivastava 2011; Kirbaslar et al. 2012), and fatty acid composition of sunflower seeds (Gupta and Shrivastava 2004; Ekin et al. 2005; Izquierdo and Aguirrezábal 2008; Werteker et al. 2010; Ingale and Shrivastava 2011; Alberio et al. 2016; Li et al. 2017). Some other researchers used different machine learning algorithms for quality traits of sunflower seeds (Kavdır and Guyer 2008; Omid 2011; Mollazade et al. 2012; Zielinska et al. 2012; Barbosa et al. 2015; Soltani and Omid 2015; Siedliska et al. 2017; Niazian et al. 2018; Abdipour et al. 2019; Gumus et al. 2018; Kurtulmuş 2020). However, present literature reviews revealed that there were any studies about discrimination and classification of sunflower seeds based on physical and chemical properties with the use of machine learning algorithms. Therefore, this study was conducted to (i) determine the size, shape, mass attributes, fatty acid, and mineral composition of sunflower seeds; (ii) distinguish the sunflower oilseed varieties; and (iii) classify and compare the samples by using the different machine learning algorithms.

Materials and Methods

Plant Material and Sample Preparation

Seeds of six sunflower cultivars (Transol, Colombi, Tunca, 63MM54, LG5582, and P64LC53) were used as the plant material of the present study. Sunflower seeds were harvested from a local farm located about 48 km away from Konya (Turkey) in October 2018 (38° 18′ 54.5″ N 32° 46′ 35.6″ E). Foreign materials such as dust, dirt, bark and deformed and immature seeds were manually eliminated from the raw materials. The seeds were preserved in a fridge at 5 ± 0.5 °C until the time of analysis. Before the analyses, sufficient quantity of seeds was removed from the fridge and kept under room conditions for 2 hours. Seed mass was determined with a precise balance (±0.001).

Determination of Proximate Composition

The methods specified in AOAC (2000) were used to determine crude ash and oil content of the sunflower seeds. Samples were dried in an oven at 105 °C for 4 hours to get dry matter content. All proximate analyses were performed in three replications.

Determination of Fatty Acid Composition

Fatty acid composition of the sunflower seeds was determined with the use of fatty acid methyl esters (Sukhija and Palmquist 1988) in a gas chromatography device equipped with a flame ionization detector and a capillary column. Helium (H2) was used as the carrier gas. Resultant peaks were compared to authentic standards to get fatty acid quantities.

Determination of Element Composition

Initially, 0.5 g ground sample was supplemented with 10 ml nitric + perchloric acid. Following the wet-digestion of the resultant mixture, diluted samples were subjected to element (B, Ca, Cd, Cu, Fe, K, Mg, Mn, Na, Ni, P, S, and Zn) analyses in an ICP-OES device (Mertens 2005a, 2005b).

Shape and Dimensions

A digital caliper (±0.01 mm) was used to measure seed length (L, mm), width (W, mm) and thickness (T, mm). Then, the volume (mm3), sphericity (φ, %), shape index (SI), roundness (R), elongation (E), aspect ratio (AR), geometric mean diameter (Dg, mm), projected area (PA, mm2) and surface area (S, mm2) of the samples were calculated with the equations given in Table 1.

Table 1 Equations used to calculate the size and shape properties of the sunflower oilseed varieties

Feature Selection and Validation Methodology

Many different techniques are used for the selection of the features. In this study, correlation-based feature selection (CFS) was used. A search algorithm and Pearson correlations are used in CFS to assess feature subsets. It allows individual characteristics to be considered with the level of interrelationship between them to estimate the CFS class label (Hall 1999). Best first search algorithm was used to select the best subset (Witten and Frank 2005). CFS (CfsSubsetEval) algorithm was applied on physical properties of sunflower seeds with the use of Weka® v3.8 software (Hall et al. 2009) and 7 features (length, width, thickness, mass, projected area, sphericity, and aspect ratio) were selected. In the study, 105 samples were measured for each feature. A total of 4410 values were used for classification after CFS attribute selection.

Cross-validation evaluates the generalization ability of each model by comparing its performance in a dataset not used during training to fit the parameters of different models. This method is applied practically and effectively in classification and prediction (Stegmayer et al. 2013). In the present study, in order to validate the classification model generated, the k-fold cross-validation method was utilized. In machine learning applications, the k value is usually chosen as 5 or 10 (Ataş et al. 2012) which was selected as 10 in the present study. The dataset was divided into 10 subsets by using 10-fold cross-validation technique. In this way, every subset had an equal proportion of each class example. Training and testing were carried out with 10 iterations. In each iteration, 1 subset was used for testing and the rest of the subsets (9 subsets) were used for training and also with each of the k subsamples used exactly once as the testing respectively. After that, the average of all errors in each iteration was taken as the error estimate of the classifier. Based on the confusion matrix, the values of evaluation metrics were calculated for algorithms. After that, the performance of classification algorithms was compared by using specified evaluation metrics (Witten and Frank 2005; Stegmayer et al. 2013).

Classification Techniques

A classification technique of machine learning was utilized in the present study by using the Weka® v3.8 software (Hall et al. 2009). Six different machine learning classifiers were performed in a laptop having a core i5 central processing unit (CPU), 2.5 GHz and 8 GB memory. The classification of varieties using machine learning was based on the main physical properties. Shape (sphericity, aspect ratio), size (length, width, thickness, projected area) and mass were used as the criteria for classifying, because these properties have been chosen by CFS attribute selection.

Decision Tree (DT)

Decision tree (DT), a supervised learning algorithm, is able to learn specific data, develop a set of rules to classify or predict an attribute (Drazin and Montag 2012). The method depends on “divide and conquer” strategy (Safavian and Landgrebe 1991). DTs are arranged so that a class is rejected at each layer of the tree. The last class under the tree is considered as the specified class. There are two different DT induction algorithms, namely J48, C4.5, and logistic model tree (LMT) used in classifications (Quinlan 1993; Landwehr et al. 2005). In this study, the decision tree so called as J48, C4.5 was used, and number of leaves was 28 and the size of tree was 55. The schematic illustration of applied DT is presented in Fig. 1.

Fig. 1
figure 1

Structure models of the decision tree (a) and multilayer perceptron (b) algorithms

Random Forest (RF)

Random forest algorithms generate more than one decision tree to train each tree in classification. Since each tree is a separate classifier in RF, this algorithm is used for group classification (Breiman 2001). Contrary to DT, a decision is made with the majority of ensemble of trees constructed by RF in assigned class of data sets (Berhane et al. 2018). Since bootstrap and ensemble scheme could overcome over fitting problem inherited from DT, there is no pruning step in RF (Ali et al. 2012). In addition, RF generally has a high predictive accuracy and is robust against noise (Breiman 2001; Rodriguez-Galiano et al. 2012).

Support Vector Machines (SVMs)

SVM has different learning algorithms used in classification and regression models. In SVM method, the optimal limit is achieved separately based on probability distribution of the training vectors in the data set. The vectors located close by the hyper plane are defined as support vectors. If the space cannot be separated linearly, a kernel function is used to examine the relationships between the investigated traits (Vapnik 2000).

Multiple Linear Regression (MLR)

MLR elucidate the linear relationships of a dependent variable with the independent variables of the interest. MLR, which is one of the prediction methods, was applied to the data sets in the study. With the prediction techniques used in present analyses (Witten and Frank 2005; Wu et al. 2008), a response variable (y) from a range of predictor variables was estimated with the use of the following Eq. 1 (Vereecken and Herbst 2004; Hastie et al. 2009):

$$ \mathit{\mathsf{y}}=\mathit{\mathsf{a}}\sum \limits_{\mathit{\mathsf{i}}=\mathsf{1}}^{\mathit{\mathsf{n}}}{\mathit{\mathsf{b}}}_{\mathit{\mathsf{i}}}{\mathit{\mathsf{x}}}_{\mathit{\mathsf{i}}}+\varepsilon $$
(1)

where a is an intercept, xi is a predictor variable, bi is a regression coefficient, and ε represents the error.

Naïve Bayes (NB)

This algorithm computes the previous possibilities of each attribute in each class. It is assumed that the possibilities are independent from each other. Classification is done using known probabilities of each class and known probabilities of each attribute. The subsequent probability of class Ci can be written as (Eq. 2):

$$ P\left({C}_i\right)\times P\left({a}_1={x}_1\kern0.36em and\kern0.36em {a}_2={x}_2..\dots \kern0.36em and\kern0.24em {a}_n={x}_n\mid {c}_i\right) $$
(2)

Assuming that the attributes are independent, the value of this expression can be reached by the following Eq. 3:

$$ P\left({C}_i\right)\times P\left({a}_1={x}_1\mid {C}_i\right)\times P\left({a}_2={x}_2\mid {C}_i\right)\times ..\dots \times P\left({a}_n={x}_n\mid {C}_i\right) $$
(3)

Thus, for equation i = 1, 2, the product is calculated and the classification with the highest value is selected (Bramer 2007).

Multilayer Perceptron (MLP)

ANN models are composed of interconnected processing elements similar to biological neurons and weighted connections similar to brain snaps (Karray and Silva 2004). Multilayer perceptron (MLP) is commonly used ANN for classification purposes (Omid et al. 2010). MLPs consist of input layers, hidden layers, and output layers. MLP is a feed-forward neural network (FFNN), namely, data flow through input layers toward to output layers in a single direction. In classification, MLP model is performed with the use of back propagation algorithm (Karray and Silva 2004). MLP is commonly used in classification of agricultural products (Marini et al. 2004; Mollazade et al. 2012). In this study, 7-6-6 MLP structure model with a learning rate of 0.3, momentum of 0.2, number of epochs of 500 and sigmoid function was used for classification (Fig. 1).

Statistical Analyses and Variety Classification

One-factor variance analysis was applied the experimental data with the use of SPSS v20.0 (IBM SPSS® 2010) software and significant means were compared with the use of Tukey’s test (p<0.05). Canonical correlation was applied to examine the relationships among the investigated parameters. A scatter plot with canonical discrimination functions presented the group centroids of varieties. The principal components were assessed for multivariate tests (MANOVA). Hotelling’s pairwise comparisons were conducted with the use of PAST v3.20 software to see the differences in sunflower oilseed varieties (Hammer et al. 2001).

Model Performance Evaluation

Model performance was assessed through statistical indicators including classification accuracy (Ac), sensitivity (Se), specificity (Sp), F-measure (F), precision (P), kappa statistic (κ), mean absolute error (MAE), and root mean square error (RMSE) and was calculated using Eqs. 4, 5, 6, 7, 8, 9, 10, and 11, respectively. Kappa statistics measure interrater reliability among the classifiers. Kappa statistics of −1 indicates “total disagreement,” 0 indicates “random classification,” and 1 indicates “perfect agreement” (Viera and Garrett 2005). The following equations are used for calculations (Parker 2001):

$$ {A}_c=\frac{TP+ TN}{TP+ FP+ TN+ FN}\times 100 $$
(4)
$$ {S}_e=\frac{TP}{TP+ FN}\times 100 $$
(5)
$$ {S}_p=\frac{TN}{TN+ FP}\times 100 $$
(6)
$$ F=\frac{2\times P\times {S}_e}{P+{S}_e} $$
(7)
$$ P=\frac{TP}{TP+ FP} $$
(8)
$$ \kappa =\frac{Prob\left(\alpha \right)- Prob(e)}{1- Prob(e)} $$
(9)
$$ MAE=\sum \limits_{i=1}^n\frac{\mid {E}_i-{M}_i\mid }{n} $$
(10)
$$ RMSE=\sqrt{\frac{\sum \limits_{i=1}^n{\left({E}_i-{M}_i\right)}^2}{n}} $$
(11)

where n = number of data instances, Mi = measured target value, Ei = predicted target value, Prob(α) = total agreement probability, Prob(e) = hypothetical probability of chance agreement, TP = number of true positives, FP = number of false positive, TN = number of true negatives, and FN = number of false negatives. The accuracy was assessed through goodness of the prediction (Colton 1974). The correlation coefficients of between 0 and 0.25 indicate little or no relationship, 0.25 and 0.50 indicate fair degree relationship, 0.50 and 0.75 indicate moderate to good relationship, and 0.75 and 1.0 indicate very good to excellent relationship.

It is possible to use relative operating characteristic (ROC) curves to visualize the tradeoffs between the correctly classification of positive and negative rates. A ROC curve is a graphical plot of the true positive rate, also known as sensitivity, versus the false positive rate or one minus the specificity, as its threshold of discrimination is varied. A classifier with a particular tradeoff between sensitivity and specificity represents each point on the ROC curve. The information expressed by the ROC curve must be collapsed into a single response variable to compare the output of multiple classification schemes with statistical instruments (Pietersma et al. 2003; Stegmayer et al. 2013). The region under the entire ROC curve (AUC) was suggested as an acceptable performance index for this reason (Bradley 1997) because it is between 0 and 1 and makes it easier to compare classifiers with each other. Additionally, higher AUC value indicates a better predictive capability of a model. When AUC is close to 1, it means that a score higher than any non-class sample has been allocated to most of the positive class samples, which means that there is a threshold that perfectly distinguishes.

Results and Discussion

Six sunflower varieties (105 single seed) were evaluated in terms of fatty acid and mineral composition, biochemical characteristics, and physical attributes and experimental findings were provided in subsequent sections.

Biochemical Attributes

The proximate compositions of sunflower seeds are given in Table 2. The ash content of sunflower seeds ranged between 3.20 and 4.80%. The greatest ash content was obtained from Transol variety and the least was obtained from Tunca variety. Ash content of sunflower seeds was reported as 4.82–4.87% (Ingale and Shrivastava 2011). Present findings on ash content comply with the results of Gupta and Shrivastava (2004). Seed oil contents ranged from 39.9 to 48.1% with the greatest value in Tunca and the smallest value in Colombi varieties. Oil content of sunflower seeds was reported as between 38.13 and 50.20% (Ekin et al. 2005). Differences in oil contents were mostly attributed to environmental conditions, genotypes and cultural practices.

Table 2 Proximate composition of sunflower oilseed varieties

The fatty acid composition of sunflower seeds is given in Table 3. Significant differences were observed in crude oil content and fatty acid composition of sunflower varieties. Among 10 fatty acids identified in sunflower seeds, oleic acid (C18:1 n9c) was identified as the major fatty acid followed by linoleic acid (C18:2 n6c) and palmitic acid (C16:0). Saturated fatty acids are responsible for the cholesterol level in humans; however, it depends on the carbon-chain length and bioavailability (Webb and Sanders 1991; Cater and Garg 1997). Related to this information, behenic acid (C22:0) and arachidic acid (C20:0) acids are less responsible as compared to the myristic (C14:0), palmitic (C16:0) and stearic acids (C18:0). In present study, behenic (C22:0) and arachidic (C20:0) acids were the minor fatty acids in all sunflower varieties. The highest myristic acid (C14:0) was determined in P64LC53 and Tunca varieties (0.07%). The highest behenic (C22:0) and arachidic (C20:0) acids were observed in 63MM54 variety. Similar findings were reported by Izquierdo and Aguirrezábal (2008). Palmitic and stearic acids ranged between 3.58–5.91% and 2.66–3.49 %, respectively. In a previous study, similar palmitic acid (C16:0) and higher stearic acid (C18:0) contents were reported (Alberio et al. 2016). For dietary purposes, higher mono-unsaturated fatty acids, especially higher oleic acid is an important factor for human health (Esmaeili et al. 2012; Salimon and Farhan 2012). Also, in recent years, many studies have been carried out to get more oleic acid in seed with genetic improvements. Present oleic acid (18:1 n-9) contents varied between 22.3 and 26.5%. Oleic acid content of the studied materials is very low as compared to high oleic acid varieties. Present findings on oleic acid contents are consistent with results of Li et al. (2017). Angeloni et al. (2017) reported varying oleic acid contents with the genotypes and growing seasons. Werteker et al. (2010) reported highly variable oleic acid contents in sunflower seeds as compared to rapeseed and soybean seeds. Present linoleic acid (C18:2n6) contents ranged between 66.00 and 67.43%. Such findings comply with the results of Li et al. (2017). Linoleic acid is an essential fatty acid and plays an important role in various physiological functions of human body, especially in prevention of cardiovascular diseases (Connor 2000; Arts et al. 2001). Present linolenic acid (C18:3n3) contents ranged between 0.04 and 0.10%. In a previous study, linolenic acid (C18: 3n3) contents were reported as between 0 and 3.9% (Werteker et al. 2010). Konuskan et al. (2017) indicated significant effects of variety, ecological and growing conditions and agronomic practices on fatty acid composition of oilseeds.

Table 3 Fatty acid composition (%) of different sunflower oilseed varieties

Element composition of sunflower seed samples was tabulated in Table 4. As can be seed from Table, 13 different elements were detected in seed samples and differences in micro element composition of the varieties were found to be significant (p<0.05). Phosphorus (P), potassium (K), and sulfur (S) were the major elements with average concentrations of 6225.4, 5487.7, and 2454.2.6 mg/kg, respectively. The greatest K content was obtained from P64LC53 (8507 mg/kg) variety and the lowest from Transol (3675.7 mg/kg) variety. The greatest magnesium content was detected in P64LC53 and the least in Colombi variety. The lowest element concentrations of all sunflower varieties were observed for cadmium which ranged between 0.04 and 1.63 mg/kg. Following this mineral, trace quantities of Ni, Ba, Cu, Fe, Mn and Zn were observed (ranging between 6.01 and 50.87 mg/kg). Kirbaslar et al. 2012 reported the mineral composition of some seeds including sunflower and stated that potassium was the major element (3094.1 mg/kg) and Cu, Zn and Fe were observed in trace quantities (10.13, 28.30 and 38.65 mg/kg, respectively). Ingale and Shrivastava (2011) reported Ca and P of two sunflower genotypes respectively as between 1070 and 1500 mg/kg and between 4000 and 3900 mg/kg. Among the present samples, 63MM54 and P64LC53 were found to be prominent for mineral composition. These two genotypes showed the significantly higher concentrations for majority of the major and minor elements (Table 4).

Table 4 Mineral composition of different sunflower oilseed varieties (mg/kg)

Shape and Dimensional Attributes

The gravitational, dimensional and area attributes of the varieties are provided in Table 5. There were significant differences in physical attributes of the sunflower varieties (p<0.01). Seed mass is an appropriate physical property to determine the quality of seed. The greatest seed weights were obtained from Colombi and LG5582 varieties and the least from 63MM54, P64LC53 and Tunca varieties. Similar with the present findings, Munder et al. (2017) reported seed weights of hybrid sunflower variety (PR65H22) as between 0.024 and 0.108. Demir et al. (2018) indicated projected area as an important classification criterion. Projected areas of the present sunflower seed samples varied between 27.37 and 41.46 mm2 with the greatest value from Colombi and the least from Tunca varieties. Mirzabe et al. (2012) reported projected area of Sirena sunflower variety as between 27.38 and 69.82 mm2.The greatest surface areas were obtained from LG5582 and Colombi varieties (165.82 and 174.03 mm2, respectively). However, Ortiz-Hernandez et al. (2020) reported greater surface areas for P64H41 precocious hybrid sunflower seeds than the present values. The greatest volume and geometric mean diameter values were determined in Colombi variety as 201.50 mm3 and 7.26 mm, respectively (p<0.05). Malik and Saini (2016) reported volume and equivalent (geometric mean) diameter values of PSH-996 variety at five different moisture contents (07.60–25.00% w.b.) respectively as between 192.61 and 262.77 mm3 and between 7.15 and 7.93 mm and those values were greater than the present ones. Gupta and Das (1997) reported geometric mean diameter of sunflower seeds as 4.72 mm. The greatest length and widths were obtained from Colombi and 63MM54 varieties. Santalla and Mascheroni (2003) reported seed length and width as 11.526 and 5.008 mm, respectively. Khodabakhshian et al. (2010) investigated the physical properties of Shahroodi variety and reported the thickness values as between 3.88 and 4.94 mm.

Table 5 Gravitational, dimensional and area attributes for sunflower oilseed varieties

The dimension and shape attributes are important for sunflower dehuller and separator. Sphericity, shape and aspect ratio of the sunflower varieties are provided in Table 6. All sunflower varieties were not close to sphere. The highest sphericity values were determined in 63MM54 and P64LC53 as 60.82 and 60.36% varieties, respectively. Gupta and Das (1997) reported average sphericity of Morden variety as 57%. Increasing sphericity values were observed with decreasing shape index values. All varieties with a shape index greater than 1.25 were described as oval. Decreasing roundness and sphericity values were observed with increasing shape index values. Roundness values close to unity indicate an almost circular seed shape. All average roundness values of sunflower seeds were close to each other and varied between 0.27 and 0.37. According to Table 6, the highest aspect ratio was observed in 63MM54 variety and the greatest elongation values were found in Tunca (2.10) and LG5582 (2.14) varieties. Jafari et al. (2011) reported average elongation value of Shamshiri variety as 3.26. Elongation and aspect ratio were negatively correlated. Increasing aspect ratios were observed with decreasing sphericity.

Table 6 Sphericity, shape and aspect ratio attributes for sunflower oilseed varieties

Discrimination of the Sunflower Varieties

Results of canonical discriminant functions are provided in Table 7. The higher the eigenvalues, the greater the function explains dependent variable. Functions effect size depends on square of the canonical correlation values. First three functions explained 93.3% of the total variation in physical attributes of the varieties. Since the results of biochemical attributes are insufficient in number, it is not handled by this method. Wilks’ lambda generally gives the best prediction model and is important for each predictor variables those are ideal. In this work, canonical correlations for 5 results were significant (p<0.01). In Wilks’ lambda statistics, only 13% of the difference was not explained.

Table 7 Discriminant analysis results for the studied parameters of sunflower oilseed varieties

The standardized canonical discriminant function coefficients present the relative importance of 12 predictors for both physical and chemical attributes. In function 1, the highest loadings were observed for thickness and geometric mean diameter. In function 2, the greatest loadings were observed for thickness and aspect ratio. Additionally, for function 1, palmitic and oleic acid, for the function 2, arachidic and eicosenoic acids were the strongest predictors for the biochemical attributes.

Figure 2 shows the centroids of 6 sunflower oilseed varieties based on their canonical discriminant functions. With respect to the differences between principal components, thickness, geometric mean diameter and aspect ratio were considered as an important distinguishing feature. Thickness was identified as the most important feature that enables discrimination of varieties. The 63MM54, P64LC53 and LG5582 varieties which are similar in many attributes located on the right of the canonical function 1 axis. Since Colombi and Transol are particularly similar in shape, they were located close to each other. Tunca, which has the lowest values in terms of most attributes, is clearly distinguished from the other varieties and located on the bottom of the canonical function 2 axis. The aspect ratio for 63MM54 and P64LC53 varieties constituted a separate group.

Fig. 2
figure 2

Scatter plots of the sunflower oilseed varieties in terms of the discriminant scores and group centroids function 1 and 2

Multivariate Tests (MANOVA) and Pairwise Comparison

The results of the MANOVA test, Mahalanobis distances and Bonferroni corrected are given in Table 8. According to Wilks’ lambda and Pillai Trace statistics, all varieties were found to be significant in terms of the physical attributes (p<0.01). Pillai Trace statistics, considered to be the most reliable among multivariate evaluations, consider the sum of the variance that explains the greatest discrimination of independent variables in dependent variables. Wilks’ lambda clarified the variance percentage in dependent variables and expressed them with differences in independent variables. The smaller the “Wilks’ lambda” statistics, the greater the differences among the groups to be analyzed (Sayıncı et al. 2015). Mostly, the varieties which Mahalanobis distance is lower than 3 showed statistically similar physical attributes (p>0.05). According to Mahalanobis distances among the sunflower varieties, the closest distances were observed between Colombi and Transol varieties, and between 63MM54 and P64LC53 varieties, while the distance between the Tunca and all other varieties had the greatest values. Confusion matrix results are presented in Table 8. Discriminant analysis was able to classify sunflower varieties with 77.94% accuracy. The greatest correctly classification was obtained from Tunca and P64LC53 varieties. However, among all varieties, Colombi and 63MM54 had the worst correct classification.

Table 8 Differences among the sunflower oilseed based on seed outlines for physical attributes

Classification Results

Classification matrix for size, shape, area and mass attributes and performance comparisons made with the use of 6 classification algorithms (DT, RF, SVM, MLR, NB and MLP) are given in Table 9. The confusion matrix of classifiers is shown in Table 10. Classification accuracies ranged from 74.44 to 80.16% and RF algorithm had the greatest Ac and the lowest RMSE (0.218) as compared to the other algorithms. DT which has 55 tree and 28 leaves yielded busy and complicated tree. Mass was identified as an important classification node in the structure of trees. For this reason, if an automatic system for sorting is developed based on the classifiers, mass attribute can be used as a main criterion for product quality assessment (Banakar et al. 2017). DT classification accuracy was 75.71% which was the third lowest accuracy after NB (74.44%) and MLR (75.24%). Similar to present findings, Zielinska et al. (2012) reported that engineering properties-based classification accuracy of four different red clover seeds ranged from 41.0 to 78.5% for Naïve Bayes and from 52.8 to 85.4% for decision tree algorithms. Omid (2011) determined reported the correct classification rates and RMSEs respectively as 99.52% and 0.07 for training set and respectively as 95.56% and 0.21 for the test set (Table 9).

Table 9 Performance results of machine learning classifiers
Table 10 Confusion matrix of classifiers for sunflower oilseed varieties (10-fold cross-validation)

There are mainly four types of kernel functions (polynomial, normalized polynomial, radial basis and Pearson VII function) to select the best SVM. In this study, the appropriate kernel function for SVM was Pearson VII function. The classification matrix of SVM showed that among 630 sunflower seeds, 502 seeds were categorized precisely (Table 10). The accuracy calculated was 79.68% for SVM. However, SVM had the highest MAE and RMSE values. Soltani and Omid (2015) reported the accuracy of SVM Pearson VII function-based function as 0.891. Mollazade et al. (2012) obtained the greatest r and lowest RMSE from SVM polynomial and Pearson VII function-based kernel functions.

A multilayer perceptron (MLP) neural network was applied for analysis of size, shape, and mass of the sunflower seeds. Seven physical attributes were used as inputs, there were six hidden layers and 6 outputs (i.e., 7-6-6 structure). Accuracy, Kappa, sensitivity, specificity, precision and F-measure values of MLP classifier were respectively identified as 78.89%, 0.75, 0.789, 0.958, 0.790, and 0.789. Kavdır and Guyer (2008) used artificial neural networks, MLP and DT classifiers to apples and reported the greatest classification accuracy (83.33%) for MLP, then (75.56%) for DT. Gumus et al. (2018) informed that the accuracy of six olive oil classifications based on biochemical attributes ranged from 81.63 to 85.71%. The authors expressed that ANN results exhibited a well-correlation with PCA results. Zielinska et al. (2012) used multilayer perceptron and reported the accuracy of varietal classification as between 57.6 and 89.6%. These findings support the present ones.

NB classifiers had the lowest accuracy, kappa statistic, sensitivity, specificity and precision as 74.44%, 0.69, 0.744, 0.949 and 0.745, respectively. In fact, NB had the lowest MAE values. Present findings comply with the results of Gumus et al. (2019) reporting accuracy as 79.59% for Naïve Bayes, 79.59% for decision tree, and 83.67% for support vector machine. In contrast, accuracy and kappa statistic values of Naïve Bayes (74.44%—0.690) reported by Siedliska et al. (2017) were greater than those of the present work (87.6%—0.724) for fresh cherries, but similar with the accuracy values of random forest (80.3%). Barbosa et al. (2015) also reported greater accuracy (95.4%) for Nb and Rf than the present study.

MLR is a simple model used for estimation in agricultural practices (Abdipour et al. 2019). In this study, the minimum value of F-measure was obtained for MLR and the calculated values of Ac, Se, and Sp were 75.24%, 0.752, and 0.950, respectively. Among all classifiers, MLR was the best classifiers for LG 5582, 63MM54 and Tunca varieties. Similarly, Abdipour et al. (2019) used ANN and MLR and indicated that MLR was less capable to predict seed yield in both training as R2=0.710, RMSE=0.064, and MAE=0.052 and testing R2=0.686, RMSE=0.071, and MAE=0.054. Present findings comply with the results of Niazian et al. (2018) reporting the ability of MLR model to predict seed yield for training (R2=0.81) and testing (R2=0.79) stages. Author also indicated that the performance of ANN was better than MLR.

The accuracy of the predictive model was presented in the ROC curve. As seen in Fig. 3, the receiver operating curve is a graphical representation of the effectiveness of the predictive model confirmed that the classifier accurately identified varieties. As expected, the highest ROC area values were obtained from RF, MLP, and SVM. In particular, the ROC area values ensure quite a high performance for automatic identification of any understudy of the variety classification, as the values obtained are higher than 0.84 in all cases.

Fig. 3
figure 3

ROC Curves and area values for machine learning algorithms (a DT, b MLP, c MLR, d NB, e RF, f SVM)

Conclusion

The present study focused on fatty acid and mineral profile and physical attributes of six different sunflower oilseed varieties and detailed discrimination and classifications were performed based on physical attributes. In general, the greatest shape and dimension values were observed in Colombi and LG5582 varieties, while the highest mineral and fatty acid composition values were observed in 63MM54 and P64LC53 varieties. In terms of physical attributes, excluding shape index and elongation, Tunca variety had the lowest values. The first three functions explained 93.3% of the total variation in physical attributes of the varieties. In the Wilks’ lambda statistics, unexplained part of the similarities or dissimilarities among the groups was 13%. Colombi and Transol varieties with the lowest Mahalanobis distances had the closest attributes. Before classification, feature selection was applied to physical attributes, and projected area, length, sphericity, width, thickness, aspect ratio and mass were identified as the best features by using CFS-Best First method. Among six classifiers (DT, RF, SVM, MLR, NB and MLP), the RF and SVM (with Pearson VII kernel function) yielded better outcomes. MLP with a 7-6-6 topology also yielded quite a well classification for sunflower varieties.

Correct classification of sunflower oilseeds that meet specific requirements is important in the food and agricultural industries. By providing a correctly classified final product, the proposed algorithms could help food and seed companies (Kurtulmuş 2020). By using the algorithms and techniques suggested in the study, a separator or sorter could be designed, and the actual classification accuracy of the system could be determined during the process. Cleaning, dehulling, packaging, and oil expelling apparatuses are generally designed based on physical features of the seeds. Such features are also utilized in the design of conveyor systems used in product transportation. In addition, the specified features can be helpful in the selection, breeding, quality assessment, breeding, chemical, and pharmaceutical industries (Çetin et al. 2020). This research showed that using machine learning algorithms, the classification of sunflower seeds was feasible. The results observed from the tests were encouraging in that the approach proposed here could classify related sunflower oilseed varieties. Future studies related to seed classification with machine learning should also be inspire more dataset, features, and algorithms by these results.