Introduction

Soybean is regarded as a miracle crop due to its numerous beneficial properties and versatile usages in human food and animal diet (Ali et al. 2022; Hartman et al. 2011). Being main source of edible oil, soybean is a protein-rich crop mainly used to feed livestock, poultry and aquaculture (Guo et al. 2022; Medic et al. 2014; Selle et al. 2020). It is one of the most widely cultivated oilseed crop, accounting for more than 60% of the global oilseed production (Soy Stats 2022), and holding a premium position in terms of area and production among the oilseed crops (Rai et al. 2016). Globally, the growing area engaged under soybean cultivation is about 122 million hectares, with a total annual production of 385.524 million tons and an estimated average grain yield of 2.8 tons ha−1 (USDA 2019). Brazil is the world's leading producer and exporter of soybean accounting for 36% of global production, followed by the United States (28%), while Argentina, China, and Paraguay each contributing about 15%, 5% and 3% of global production, respectively (FAOSTAT 2023). The demand of soybean has increased manifolds during the recent years, owing to its extensive utility in human food and animal feed as well as multiple industrial applications (Dei 2011; Wilson 2004).

Natural genetic diversity is the basis for crop improvement as well as plant survival that should be exploited to cope with food security demand for growing world population (Breseghello and Coelho 2013; Dong et al. 2014; Salgotra and Chauhan 2023). The presence of genetic variability would help plants in adaptation to a wide range of ecological conditions; however, low variability responses to crop susceptibility to various environmental stresses (Ali et al. 2023; Maxted et al. 1997). Exploration of genetic diversity in initial breeding material provides more opportunities to select potential lines for direct cultivation or efficiently utilized in cross-breeding programs (Govindaraj et al. 2015; Misganaw et al. 2023; Yadava et al. 2022). Thus, the knowledge of genetic diversity would assist plant breeders in broadening the genetic base of adaptable cultivars (Bhandari et al. 2017; Fu 2015; Tester and Langridge 2010; Žulj Mihaljević et al. 2020).

The ultimate goal of a soybean breeder is to develop new cultivars with improved seed yield and acceptable nutritional quality to meet the growing demand for edible oil and protein meal (Dornbos and Mullen 1992; Ghanbari et al. 2018; Sobko et al. 2020). Large-seeded soybean varieties with high protein content coupled with significant amount of oil concentration are foremost demand by the food and feed industries (Stobaugh et al. 2017; Xu et al. 2022). Therefore, novel genes or alleles for these targeted traits must be searched from diverse sources and recombined through conventional breeding procedures to maximize soybean productivity along-with nutritional quality (Mello-Filho et al. 2004; Sharma et al. 2014; Zhao et al. 2021). Nevertheless, selecting promising parents with multiple features is a monumental task in soybean breeding because of the narrow genetic bottleneck of modern cultivars and unknown pedigrees of available genetic resources (Hyten et al. 2006; Kumar et al. 2022; Grainger and Rajcan 2014). Thus developing improved varieties with desired attributes including high seed yield and enhanced nutritional quality is the central focus of soybean breeders (Ali et al. 2022; Sudarić et al. 2019).

Despite being an agricultural country, Pakistan is facing severe shortage of edible oils and overwhelmingly depends on the import of edible oil to meet domestic needs (Asad et al. 2020; Tariq et al. 2022). The total domestic production of edible oil from all sources is about 0.496 million tons, accounting for 12% of the entire necessity. However, 3.177 (88%) million tons would acquire from foreign sources with annual spending of $ 3.562 billion, resulting in a massive trade deficit and a huge burden on the national exchequer (Government of Pakistan 2022–2023). Moreover, Pakistan is unable to produce enough soybeans mainly due to the lack of improved quality varieties with desired characters (Asad et al. 2020; Nasir et al. 2023). Currently, very few varieties are available for general cultivation but most of them are deficient in desirable attributes (Iqbal et al. 2008). Even though the climatic conditions and soil composition of the country are ideal for soybean cultivation, but very little effort has been made to improve the genetic make-up of this valuable crop (Iqbal et al. 2010; Malik et al. 2007). As a result, the imports of soybean has been increased manifolds in recent years and almost reached up to 2.5 million metric tons (USDA 2019), owing to urban sprawl and the steadily expanding poultry industry of the country (Habib et al. 2016).

Agro-morphological characterization is a conventional breeding procedure which used to determine the yield potential, genomic selection and maintaining genetic purity of cultivated varieties (Farahani et al. 2019; Li et al. 2020; Malek et al. 2014). The phenotyping of germplasm provides a valuable information for sustainable gene-bank conservation, dynamic management and optimal utilization of elite genetic resources in initial breeding programs (Dong et al. 2004; Gautam et al. 2004; Govindaraj et al. 2015). Traditionally, genetic variability among local and exotic genetic pools of soybean has been evaluated on the basis of pheno-morphic and agronomic traits to undertake selection of elite breeding lines focusing to improve soybean productivity and nutritional composition (Assefa et al. 2019; Sharma et al. 2014; Kumar et al. 2015; Ghanbari et al. 2018). Thus, the current study aims to provide comprehensive information on agro-morphological characterization and selection strategy of available soybean germplasm which might be utilized in future breeding programs to develop new varieties with high yield potential and improved nutritional quality in order to ensure national food security.

Materials and methods

Experimental site discription, plant materials and layout of design

The field experimental trial was conducted at the Knoot Research Farm of PMAS-Arid Agriculture University Rawalpindi, Pakistan, during two Kharif seasons (2019 and 2020). The topsoil (0–15 cm) profiling of research site was examined as sandy clay loam with 54.2% sand, 23.4% silt and 22.4% clay. The geographic position of research station is 33°06′ N latitude, 73°01′ E longitudes and an altitude of 474 feet above sea level. A mini core collection, comprising fifty-nine soybean genotypes (56 test entries and three checks) originated from different countries were explored for qualitative and quantitative traits of soybean. Information of genetic material regarding passport code, variety name, geographic origin and collection source are presented briefly (Table S1).

The germplasm was grown under natural field condition for two cropping seasons using augmented block design (Federer 1956). At the start of rainy seasons, sowing was done on 14th July during the first year and 10th July during second year, depending on the soil moisture availability. The meteorological environment data of the experimental site during two cropping cycles is summarized (Table S2). Eight blocks were assigned for studied soybean genotypes and each block contained seven test entries and three standard checks. All genotypes were planted in a single row of 5-m length and space between row-to-row and plant-to-plant was kept 45 cm and 10 cm, respectively. The test entries were planted in a single replicate, whereas commercial checks were repeated randomly once in each block. The seeds were sown manually in a depth of 3–4 cm and thinning was done after germination in order to maintain a density of 20 plants per linear meter. To ensure proper crop growth, recommended cultural practices were implemented uniformly for each block. The agro-morphological measurements were recorded according to the standard descriptors for soybean (IBPGR 1984). However, seed quality parameters such as protein and oil content were estimated separately for all tested entries and replicated checks following standard procedures (AOAC 1995; Latimer 2016).

Data analysis

The transformed numerical codes data derived from each qualitative trait were subjected to cluster analysis using Ward’s distance matrix (Ward 1963). The Shannon–Weaver diversity index (H′) for all qualitative traits was calculated as per equation (Shannon 1948).

$$\text{Shannon\,Index }(H^{\prime})=-\sum_{i=1}^{n} pi\,{\text{log}}\,pi$$

Pi = frequency of total number of genotypes; n = total number of classes for a trait; i = total number of individuals; log = maximum value obtained from all groups.

The mean data of twenty quantitative agro-morphological traits were subjected to Analysis of Variance (ANOVA) and using statistical R software (ver. 3.6.1) for augmented block design analysis. The descriptive statistics was summarized for the arithmetic mean, range (minimum and maximum), variance and coefficient of variation using computer software IBM SPSS version 23 (Kirkpatrick 2015). Two-year mean data (Tables S3S22) were adjusted by ignoring treatment and block effects and computed for multivariate statistical analyses. Non-hierarchical K-means clustering was calculated based on Euclidian distances to discriminate the trait-specific groups using XLSTAT. The correlation matrices were calculated at a significance level of α = 0.05 (Kwon and Torrie 1964), whereas Principal Component Analysis (PCA) was performed by using R-Studio software.

Results and discussion

Phenotypic characterization of qualitative traits

Seed morphometry is a vital feature for genomic classification and maintaining the seed quality of soybean. The consumers’ preference and global market demand are greatly inspired by soybean seeds quality and purity (Sudarić et al. 2019). In present study, frequency distribution of eight qualitative traits was visually assessed for all soybean genotypes (Table 1). The tested germplasm exhibited remarkable variation for flower colour and pod pubescence, while stem growth, pod colour, seed coat colour, hilum colour, seed shape and seed lustre displayed relatively moderate variation across the genotypes. The wide range of variation for qualitative traits assist soybean breeders to select promising lines with distant features (Kachare et al. 2020; Shrestha et al. 2023).

Table 1 Frequency distribution, percentage and Shannon Diversity Index (SDI) for eight qualitative traits

Among the studied soybean population, majority of genotypes attended determinate stem growth habit. The white flower colour was more common in population than purple flower colour. There were three different colours were recognized for pods appearance. Hairy structures (pubescence) on the upper surface of pods were present on 39 genotypes and absent on 20 genotypes (Table 1). The evaluated genotypes showed discernible variation for other seed traits, such as seed coat colour, hilum colour, seed shape and seed lustre (Fig. 1). Furthermore, the Shannon’s Diversity Index (H′) was highest (1.704) for hilum colour and seed shape (1.268) while it was found lowest (0.364) for flower colour because 88% of the entire population retained white flower colour. Several previous studies highlighted the importance of qualitative traits in germplasm characterization and varietal identification of soybean (Tripathi and Khare 2016; Khanande et al. 2016; Shrestha et al. 2023; Arteaga et al. 2019).

Fig. 1
figure 1

The phenotypic variation in seed appearance and seed shape of fifty-nine soybean genotypes

Cluster analysis based on qualitative traits

The cluster analysis distributed the soybean genotypes into three main groups and six distinct clusters. The highest number of genotypes (13) were retained in cluster-III, followed by two similar clusters i.e., cluster-IV and cluster-V with 12 genotypes in each. Minimum six genotypes retained by cluster-II, chased by cluster-VI and cluster-I with 7 and 9 genotypes, respectively (Fig. 2). The clustering patterns were found to be very logical based on flower colour because first five clusters (I–V) having 52 genotypes exhibited white flowers, while cluster-VI exhibited only seven genotypes with purple colour. This indicated that the flower colour would be an important marker-trait in soybean breeding in order to identify the desired cross combinations.

Fig. 2
figure 2

Dendrogram showing the divergence and relatedness of fifty-nine soybean genotypes for eight quantitative traits

Analysis of variance for augmented block design

Augmented block design is an ideal model which primarily used to evaluate large sets of test entries in pre-breeding programs to identify the new selections (Federer 1963; Kempton and Gleeson 1997). It is a cost-efficient design and feasible in situations where the experimental seed is limited in quantity for replication, or the plant breeder is unable to maintain the differences between experimental units (Federer 1956). The information obtained from standard checks can be used to adjust the means of tested entries and provide an appropriate error mean square for various source comparisons (Federer et al. 2001; Saba et al. 2017). In this study, the variability range was measured for 20 agro-morphological, seed yield and quality traits of soybean and observed significant differences across genotypes and other source for all studied traits, except number of nodes plant−1 (Table 2). This indicates that evaluated germplasm penal exhibited high level of variation, which favoured the selection process (Li et al. 2020; Ullah et al. 2021). The high degree of genetic variability plays a vital role in selective breeding and exploring the genetic potential of soybean materials for varietal development (Anderson et al. 2019; Bailey-Serres et al. 2019). Several earlier studies have also reported significant variation for various agronomic traits of soybean (Iqbal et al. 2010; Khurshid et al. 2020; Ullah et al. 2021). Thus, substantial variability observed among tested soybean germplasm could be exploited to maximize the genetic gain of target traits.

Table 2 Mean square of 20 quantitative traits from analysis of variance during Kharif 2019 and 2020

Descriptive statistics for quantitative traits

The basic summary statistics (mean, range, variance and coefficient of variation) was computed for each studied trait (Table 3). A remarkable variation was observed in range of all studied traits except number of branches, pod length, pod width, seeds per pod, seed dimension-related features and stem diameter. The presence of sufficient variability indicating that the evaluated soybean materials were highly diverse and could be utilized in various breeding programs to increase soybean grain yield and nutritional quality (Aditya et al. 2011; Sharma et al. 2014). Importantly, high values of variance were noticed for DFF, DFC, DM, PH, NFP and DMWP, which indicating that these traits are of greater importance in soybean selection and genetic improvement (Reni and Rao 2013). However, moderate to low variances observed for SYP, NUFP, PC, NNP, HSW, NBP, OC, SD, ST, SW, SL, PW, PL and NSP subsequently which indicating that selection and improvement of these traits might be limited due to narrow genetic variability. Ghafoor et al. (2001) suggested that conventional breeding procedures would be helpful in improving characters having low magnitude of variance. Moreover, the coefficient of variation (CV) is a crucial indicator for measuring the precision and accuracy of field experiments. During both years, high CV (%) was observed for number of unfilled pods (NUFP), while the lowest CV (%) was recorded for pod width (Table 3). According to Poehlman (2013), the coefficient of variation less than 20% is considered to be reliable for field experiments.

Table 3 Descriptive statistics for 20 agro-morphological, yield and quality traits during Kharif 2019 and 2020

Non-hierarchical cluster analysis

Non-hierarchical clustering technique was used to categorize the soybean genotypes into five distinct clusters (I–V) based on the Euclidean distances. Among the clusters, first cluster retained highest number of genotypes (18) with cumulative contribution of 30.50% (Table 4). The remaining genotypes were grouped in such a way that 12 genotypes in cluster II, 15 were assembled in cluster III, 08 in cluster IV and 06 were grouped in cluster V.

Table 4 Five clusters of soybean genotypes based on agro-morphological, yield and quality traits

The genotypes concerning to cluster V exhibited the highest mean values for days to flowering (57.42 ± 1.86), days to flower completion (64.25 ± 1.78), maturity duration (117.50 ± 7.40), nodes number (13.78 ± 0.81), branches per plant (6.70 ± 2.43), highest unfilled pods (9.60 ± 4.47), stem diameter (6.78 ± 0.88). This indicated that late maturing genotypes with higher number of empty pods were gathered in cluster V. The genotypes with longest plant height (75.11 ± 13.01) and filled pods (69.83 ± 9.18) were retained in cluster II (Table 5). Likewise, the genotypes assembled in cluster IV were found superior for majority of seed yield related traits such as pod length (3.89 ± 0.40), pod width (8.84 ± 0.53), seeds number per pod (2.42 ± 0.24), seed length (7.23 ± 0.72), seed width (6.39 ± 0.74), seed thickness (5.27 ± 0.69), hundred seed weight (11.95 ± 2.69), dry matter (49.53 ± 7.89) and seed yield per plant (15.23 ± 4.26). For seed quality traits, relatively minor differences were observed between the clusters. Non-hierarchical cluster analysis classified the germplasm by using multiple variables simultaneously and provides distinct classes based on dissimilarity matrices (Anuradha et al. 2011). Our results are consistent with the earlier findings in which soybean germplasm were collected from different countries and classified on the basis of phenotypic variability (Getnet 2018; Malik et al. 2011). Thus selecting potential genotypes from clusters II and IV would be an efficient breeding strategy for improving soybean seed yield accompanied with nutritional composition through conventional breeding procedures.

Table 5 Grand means and standard deviation of 20 agro-morphological, yield and quality traits

Correlation analysis of quantitative traits

Correlation analysis is a powerful statistical tool for determining the interrelationships between multiple traits simultaneously. Results revealed that out of all possible 190 combinations, 106 pairs were found statistically significant at a probability level of ≥ 95% with a regression coefficient not less than 0.26 (Fig. 3). The phenological traits like DFF, DFC and DM showed a positive correlation with SD and DMWP but a negative association was observed with seed yielding traits i.e., PL, PW, SL, SW, ST and HSW. This indicated that late flowering and delayed maturity would adversely affect pod features and seed related-traits under certain edapho-climatic conditions (Jiang et al. 2014). According to Arshad and Ghafoor (2006), traditional breeding approaches might be productive in overcoming these undesirable linkages for developing short-duration cultivars with improved grain yield in soybean. The correlation among DFF, DFC, DM, NNP, NBP and NUFP were found significant and positive direction which indicating that late-flowering genotypes also delayed maturity and produce higher number of branches and empty pods. Thus selection based on these traits may not be a sound breeding strategy for improving soybean productivity.

Fig.3
figure 3

Pairwise correlation patterns between 20 agro-morphological, seed yield and quality related traits

Seed yield of soybean is a complex trait which significantly influenced by environmental changes and is primarily determined by the other contributing traits (Malek et al. 2014). In this study, seed yield (SYP) showed positive association with PH, NNP, NFP, PL, PW, NSP, SL, SW, ST, HSW, SD and DMWP but a weak association was observed for seed quality traits including protein and oil content. This indicated that simultaneous selection of elite genotypes with higher protein concentration and oil content accompanied with improved seed yield is a major challenge in soybean improvement (Cober and Voldeng 2000; Vollmann et al. 2000). The major contributing traits such as number of filled pods, seed vigour and hundred seed weight should be prioritized in the genomic selection of soybean (Liu et al. 2019; Malek et al. 2014; Malik et al. 2011). Thus, selecting promising genotypes with a set of positively correlated traits with seed yield may be favoured in indirect selection (Leite et al. 2018; Ghanbari et al. 2018).

The correlation of seed quality traits including protein content (%) and oil content (%) was found to be strongly negative (Fig. 3). This indicating that an increase in oil content would significantly reduce the protein content in soybean seed. Although soybean is a good source of edible oil and harbours a significant amount of seed protein, but the negative association between these two traits poses a real challenge for soybean breeders (Kwanyuen et al. 1997; Qin et al. 2014). This inverse relationship might be due to the distribution of carbon chains that synthesize protein and oil content in soybean (Hernández-Sebastià et al. 2005; Guo et al. 2022). It is noteworthy both seed quality traits are quantitatively inherited and determined by the interaction of multiple genes (Hwang et al. 2014). Several previous studies have also confirmed the strong negative correlation between protein content and oil content in soybean seed composition (Mello-Filho et al. 2004; Popovic et al. 2012). Thus knowledge of characters’ association in soybean would assist breeders in formulating long-term breeding strategies (Ghanbari et al. 2018; Machado et al. 2017).

Principal component analysis (PCA)

The principal component analysis was performed in order to estimate the phenotypic diversity among soybean genotypes. Scree plot indicated that out of twenty reserved principal components, only five initial PCs were declared significant as depicted eigenvalues greater than one (Fig. 4). The extracted five-component axes such as PC1, PC2, PC3, PC4 and PC5 having eigen roots ranging from maximum to minimum of 7.42, 4.73, 1.81, 1.27 and 1.01, respectively explained 81.22% of the total variability across the tested soybean genotypes. The rest of the variance (19%) was cumulatively described by the remaining 15 PCs with lower (< 1) eigenvalues, which were not considered for further interpretation.

Fig. 4
figure 4

Scree plot representing the eigenvalues and cumulative variances (%) for twenty principal components

PCA variable plot

The PCA variable plot revealed that initial two principal components i.e., PC1 and PC2 estimated more than 60% of variation with the contribution of 37.11% and 23.67%, respectively to the total phenotypic variability (Fig. 5a). The projection of quantitative traits with positive loading vectors on PC1 were SW (0.92), HSW (0.88), PL (0.87), PW (0.87), ST (0.86), SL (0.85), SYP (0.82), DMWP (0.81) and NSP (0.69) subsequently (Table 6). On the other hand, traits with short to medium loading vectors were DFF (− 0.441), DFC (− 0.421), DM (− 0.412), NBP (− 0.513) and NUFP (− 0.443) stacked negatively in the first principal component (Table 6). Likewise, PC2 was mainly related to traits were PC (0.73), DFC (0.35), DM (0.32) and HSW (0.31), whereas traits with negative loadings on second principal component were OC, PH, NBP, NFP, NUFP, DMWP and SYP (Fig. 5a). The traits which described PC3 were PC and OC with positive loading weights, while other traits like PL, PW, ST and HSW exhibited very low variances with minor effects. In the case of PC4 and PC5, none of the traits contributed significantly towards assorted variation.

Fig. 5
figure 5

a Graphical ordination of loading factor plot displaying the contribution of 20 quantitative traits. b Score plot for PC1 and PC2 showing the divergence of soybean genotypes

Table 6 The first five principal components determined the greater variation for studied quantitative traits

PCA score plot

Two-dimension PCA score plot was drawn between PC1 and PC2 for identifying the diverse genotypes. The genotypes dispersed at the extreme positions away from the origin point, demonstrated high genetic variability (Fig. 5b). Certain genotypes namely, Jhunghwang, K-D, 24,598, G-35, Brazil-3, 24,562, 24,592, 24,578, 24,560, Aga1, Ajmeri-1, NARC-2 and 24,608 were found to be more diverse in the studied materials. It is worth mentioning that the genotypes located at the right side of plot exhibited high variances, and may be given preference in the selection process (Bartual et al. 1985). Thus genotypes including G-35, K-D, Jhunghwang, and 24,598 with higher variances for one or more traits identified as valuable genetic sources for introgressing desired genes into adaptable local varieties. Several former studies have utilized the principal component analysis for determining the genetic variability and identification of diverse genotypes in soybean (Iqbal et al. 2008; Mannan et al. 2010).

Conclusion

This study comprehensively described a suitable strategy for the identification and selection of desirable superior progenitors with broad genetic spectrum based on phenotypic variability. The evaluated soybean genetic materials displayed a considerable divergence for all agronomic, yield and seed quality traits. Various agro-morphological characters were recognized as key indices based on multivariate approach, which might be effective in future soybean breeding programs to improve seed yield and nutritional quality simultaneously. Overall, three exotic lines viz., K-D, Jhunghwang and 24,598 were identified for improved productivity and may be approved for general cultivation or even conserved as elite resources for future breeding programs. Additionally, two local varieties such as NARC-2 and Ajmari-1 performed better for oil composition, hence could be utilized as potential parents for increasing oil content in other commercial cultivars. Taken together, the prominent genotypes identified during this study would be served as benchmark for developing high-yielding and nutritionally enriched soybean cultivars for local ecologies.