Introduction

Cassava (Manihot esculenta Crantz) stands as one of the most important staple crops, providing sustenance and livelihood to millions of people globally, especially in tropical regions (Ceballos et al. 2020). This crop serves as the primary source of calories and income for small-scale farmers with its starchy roots providing carbohydrates and leaves offering vitamins, proteins, and minerals (Bayata 2019). Cassava’s unique ability to thrive in marginal ecologies with low soil fertility and rainfall makes it a crucial player in global agriculture, food security, and economic growth in such ecologies (Ngongo et al. 2022).

Advancements in breeding programs heavily rely on gaining a deeper understanding of the genetic diversity within populations which functions as a repository of diverse genes with significant potential for enhancing productivity and adapting to both abiotic and biotic stress (Adu et al. 2021). The relevance is particularly pronounced in the context of ongoing climate change and global warming as genetic variation serves as a foundation upon which breeding efforts are built, providing the raw material for developing improved varieties. Variations in traits such as yield, disease resistance, drought tolerance, and nutritional content are essential for enhancing cassava resilience and nutritional value, making them fundamental prerequisites for sustainable cassava cultivation, crop improvement, and conservation. The analysis of genetic diversity and population structure plays a crucial role in understanding the historical patterns of natural selection and the genetic connections among different accessions (Luo et al. 2019).

In cassava research, genetic diversity has been extensively explored through the utilization of diverse morphological and molecular markers, encompassing various germplasm accessions and traits. Morphological markers were used to study genetic diversity and relationships among elite cassava cultivars in Ghana (Sivan et al. 2023) and Benin (Agre et al. 2016). However, the use of morphological markers has been limited for breeding purposes due to its vulnerability to environmental influence and low polymorphism. In contrast, molecular markers represent efficient and precise tools for unveiling and estimating genetic diversity and ascertaining the population structure within plant populations (Pour-Aboughadareh et al. 2022).

Several molecular markers, such as SSR (Adjebeng-Danquah et al., 2020), DArT (Adu et al. 2021), and ISSR (Sandra et al. 2019), have been utilized to assess genetic variation in cassava accessions. SNP markers have also been employed to evaluate the diversity of whiteflies (Wosula et al. 2017), cassava mosaic begomovirus (Aimone et al. 2021), and genetic information within and between populations. In recent studies, SNP markers and SilicoDArt were used to identify 47% of distinct cassava genotypes represented by 87 accessions from the CIAT cassava GenBank (Carvajal-Yepes et al. 2023). A study aimed at determining diversity within local landraces in Burundi identified six clusters and pairs of duplicates (Pierre et al. 2022). Another study reported high diversity for CIAT germplasm, while low diversity was reported for IITA and East, South, and Central Africa (Ferguson et al. 2019a). The use of SNP markers was chosen to understand the genetic diversity within and between populations because of their abundance, stability, polymorphism, and compatibility with automation.

The study aims to address the historical challenges posed by Cassava Mosaic Disease (CMD) and Cassava Brown Disease (CBSD) to cassava production by utilizing the resistant cassava populations to elucidate the genetic characteristics, population structure, and gene flow patterns. Therefore, the objectives of this study are (1) to unravel allelic diversity and heterozygosity levels, (2) to explore population structure, (3) to detect potential admixture, and (4) to identify gene flow patterns. The ultimate goal includes contributing valuable information to cassava breeding programs in Uganda for the development of more resistant and disease-resistant varieties.

Materials and methods

Plant material and populations development

Two groups comprising a total of 155 cassava genotypes were used in the study. The first group consisted of 80 biparental genotypes derived from the parents MM06 128 and TME14, collected from existing germplasm at the National Agriculture Semi-Arid Research Resources Institute (NaSARRI) in Eastern Uganda. The second group included 75 biparental genotypes resulting from crosses between the parents COL40 and TME14 obtained from the International Institute of Tropical Agriculture (IITA) in Sendusu, Central Uganda. The parent COL40 originates from South America, TME14 from West Africa, and MM060128 from East Africa.

Genotyping

A total of 155 cassava leaf samples were collected from a single representative plant per accession. A representative plant was selected randomly from a plant with young top leaves, folded, and punched four leaf disks using a 5-mm puncher and placed in a 96 plate. The samples were oven dried, shipped in silica gel to Intertek, Australia Lab for DNA extraction. Genotyping-by-Sequencing and SNP calling were done for each sample using Diversity Array Technology sequencing (DArTseq™) genotyping platform at Bruce, ACT Australia (https://www.diversityarrays.com/technology-and-resources/dartreseq). This technology involves library construction through the DArTSeq complexity reduction method, which entails digestion of the genomic DNA using restriction enzymes followed by ligation of barcoded adaptors. The adaptor-ligated fragments were then amplified using polymerase chain reaction (PCR). The resultant PCR products of each sample were sequenced using Hiseq2500 (Illumina Inc. San Diego, CA, USA). The resultant identical sequences were collapsed into FASTQCOL from which the software package DArTsoft14 was used for markers discovery and scoring. The Single nucleotide polymorphism (SNP) markers were scored and converted to HapMap format after mapping them to the cassava (Manihot esculenta) reference genome v8.1 available in Phytozome (Goodstein et al. 2012).

Genotype data processing

The data in HapMap format was converted to variant call format (VCF) using TASSEL (Bradbury et al. 2007). The genotype data was filtered by removing SNPs with less than 80% call rate and less than 5% minor allele frequency (MAF) using VCFtools (Danecek et al. 2011). The filtered markers were used for subsequent analysis. The marker characteristics such as polymorphic information content (PIC), reproducibility, and call rate were determined in the dartR package of R (Gruber et al. 2018).

Population structure and diversity analysis of cassava genotypes

Population structure analysis and admixed ancestry were estimated using a model-based clustering method implemented in ADMIXTURE software (Alexander et al. 2009). To determine the actual number of populations, ten-fold cross-validation (CV) procedure for K1 to K10 was run in ADMIXTURE, and the K value with the lowest CV error was selected as the optimal number of sub-populations. Principal component analysis (PCA) was performed in dartR package of R (Gruber et al. 2018), and the first two principal components were plotted based on the sub-populations pre-determined by ADMIXTURE to visualize structure stratification.

The genetic diversity indices including observed heterozygosity (Ho), expected heterozygosity (He), and fixation index (Fst) for the sub-populations were calculated using adegenet package in R (Jombart 2008). Analysis of molecular variance (AMOVA) was determined using the poppr package of R (Kamvar et al. 2014). The genetic differentiation among the sub-populations identified in population structure analysis was assessed using the Nei’s pairwise fixation indices (Fst) using the hierfstat package in R (Goudet 2005).

Phylogenetic analysis

A neighbor-joining phylogenetic tree showing the different sub-populations from ADMIXTURE was constructed based on the Nei’s pairwise fixation indices (Fst) generated from hierfstat package of R (Goudet 2005). The relationship among individuals was shown by generating a Euclidian distance matrix in R (R Core Team [R Foundation for Statistical Computing], 2021) which was further subjected to hierarchical clustering with the Unweighted Pair-Group Method with Arithmetic Means (UPGMA). The resultant phylogenetic tree was exported in Newick format using the ape package in R (Paradis et al. 2004) for visualization and annotation in the interactive tree of life (iTOL) Version 6.8 (https://itol.embl.de/, accessed on 5th September, 2023) (Letunic & Bork 2016).

Results

Characterization of SNP markers

In this study, a comprehensive analysis of single nucleotide polymorphism (SNPs) within the cassava (M. esculenta) genome utilizing the reference genome v8.1 was conducted. Initially, a total of 12,841 SNPs was identified, and through the application of stringent filtration criteria (retaining markers with a minor allele frequency > 0.05 and a call rate > 80), 5247 SNPs were retained for further investigation. The outcome of this filtration process is visually represented in Fig. 1, where the average call rate for the genotypes was determined to be 96%, exhibiting a narrow range of 92% to 98% (Fig. 1A).

Fig. 1
figure 1

SNP markers characteristics. A Call rate of the SNP markers. B Distribution of SNP markers across the 18 chromosomes of cassava (Manihot esculenta). C Polymorphic information content (PIC) range values of the 5247 SNP markers

The distribution across chromosomes was analyzed, and it was noted that chromosome 1 contained the highest number of SNPs, amounting to 477, while chromosome 18 displayed the lowest count with 200 SNPs. The average SNP count per chromosome was calculated to be 292, providing a comprehensive overview of the genomic landscape (Fig. 1B).

Furthermore, the 5247 retained SNPs exhibited an average polymorphic information content (PIC) value of 0.4, indicating a moderate level of diversity. The PIC values ranged from 0.10 to 0.5, highlighting the variability in informativeness among the identified SNPs (Fig. 1C).

Population structure analysis of the cassava genotypes

Population structure based on 5247 (MAF > 0.05 and 80% call rate) identified 5 sub-populations across the 155 cassava genotypes (Fig. 2). The 5 sub-populations (pop1, pop2, pop3, pop4, and pop5) were pre-defined by K value of five which showed the least cross-validation error in ADMIXTURE (Fig. 2A). Pop1 (10) and Pop2 (38) were composed of materials from a cross between a variety from Ibadan Nigeria, TME14 and Ugandan genotype MM160128. Most of the samples clustered in pop3 (75 genotypes) were from a cross between a Columbian variety COL40 and Ibadan, Nigeria variety TME14. Pop4 (18) was composed of materials from a cross between a Ugandan variety MM160128 and Ibadan, Nigeria variety TME14. The genotypes in pop5 (14) were highly admixed and were derived from a cross between Ugandan and Ibadan varieties, Variety TME14 and Variety MM060128, respectively. Significant portion of the genetic composition observed in individuals from population 5 (pop5) was derived from population 2 (pop2). Additionally, the genetic profile of population 3 (pop3) exhibited a genetic profile primarily derived from population 5 (popo5).

Fig. 2
figure 2

Population structure analysis of 155 cassava genotypes based on 5247 genome-wide SNP markers with K = 5. A Hierarchical organization of the genetic relatedness of the 155 cassava genotypes. Each bar represents a single genotype and the colored segments within each bar represents the proportional contribution of each sub- population to that accession. B A biplot of the first two principal components with colors based on the 5 sub-populations. C Neighbor-joining tree showing the genetic differentiation among the five sub-populations from ADMIXTURE analysis

In the investigation of genotypic relationship, principal component analysis (PCA) was employed as shown in Fig. 2B. The examination of the first two principal components, PC1 and PC2, revealed that they collectively accounted for approximately 24.2% of the total genetic variation. The resultant biplot of PC1 and PC2 exhibited a discernible clustering pattern among the samples, mirroring the trends observed in the structure analysis conducted through ADMIXTURE. This alignment with the source-based structure analysis underscored the reliability of the findings.

To further understand the genetic relationship among the subpopulations, a neighbor-joining tree based on the Nei’s pairwise fixation indices (Fst) was constructed (Fig. 2C). The analysis identified three major groups within the studied population. Group 1 which was made of pop1, pop3, and pop5 suggests a close genetic affinity among these subpopulations. In contrast, pop2 and pop4 each formed distinct clusters, indicating a genetic distinctiveness that sets apart from the previously mentioned Group 1. This tree-based approach provided a complimentary perspective, further enriching our understanding of the intricate genetic structure within the studied populations.

The genetic relationship among the individuals in the 5 sub-populations was determined by hierarchical clustering of the Euclidean distance matrix using UPGMA method. The analysis led to the clustering of individuals from the five sub-populations into three distinct major clades (Fig. 3): The first clade comprised of populations 1, 3, and 5 (pop1, pop3, and pop5); the second clade encompassed population 2 (pop2); and the third clade population 4 (pop4). This clustering was similar to that of the neighbor-joining phylogenetic tree in Fig. 2C above.

Fig. 3
figure 3

Phylogenetic tree for 155 cassava genotypes based on 5247 SNP markers. The color of each of the five sub-populations is based on the ADMIXTURE results. The individuals clustered into three major groups (I, II, and III)

Genetic diversity indices

The mean values for expected (He), observed (Ho), and unbiased expected heterozygosity (uHe) were 0.30, 0.32, and 0.31, respectively (Table 1). The gene diversity values represented by He ranged from 0.28 in pop1 to 0.31 in pop2. The Ho was between 0.3 (pop4) and 0.33 (pop2, pop3 and pop5).

Table 1 Genetic diversity indices of the 5 sub-populations based on SNP markers

The population fixation indices were used to estimate the genetic differentiation among the five sub-populations due to genetic structure (Table 2). Pop3 and pop4 showed the greatest genetic distance (Fst = 0.23), while pop1 and pop5 and pop2 and pop4 had the least genetic distance of Fst = 0.12.

Table 2 Sub-population’s pairwise genetic differentiation index (Fst)

Analysis of molecular variance (AMOVA) showed that the overall fixation index of the entire population was 0.13 (Table 3). The variation between sub-populations was ~ 34%, while variation within sub-populations was 66%. This finding suggests a higher level of diversity within individual accessions compared to variation observed between sub-populations indicating the presence of various genetic variants, and alleles within the accessions studied.

Table 3 Analysis of molecular variance of all genotypes using 5247 SNPs as markers

Discussion

In this study, a significant number of markers (5247 SNP) were generated from diverse collection of cassava genotypes which exhibited moderate polymorphism and revealed a moderate level of genetic diversity. These results were inconsistent with the results by Soro et al. (2023) and Ferguson et al. (2019b) that obtained lower number of SNPs; 36 SNPs from 184 cassava genotypes (Soro et al., 2023) and 1124 SNPs from 522 genotypes respectively. In addition, the contrary results were obtained in the study by Ogbonna et al. (2021) and Rabbi et al. (2017) that obtained higher 27,045 SNPs from 3354 genotypes and 72,279 genome-wide SNP markers from 672 cassava accessions respectively. These findings suggest that the number and distribution of markers across the chromosomes were linked to the level of genetic diversity (Ogbonna et al. 2021). The level of genetic diversity, as indicated by the number and distribution of SNP markers, directly influences the presence and distribution of disease resistance genes within cassava populations. The regions of the cassava genome that harbor a higher density of SNP often correspond to regions containing genes that are associated with disease resistance (Huang & Han 2014). This relationship enables breeders to identify and utilize diverse genetic resources for developing resistant varieties. It also emphasizes the importance of using SNP markers to identify and utilize genetic variants associated with disease resistance against the threatening diseases of CBSD and CMD in cassava, within crop breeding programs.

The average PIC value of SNPs markers in this study indicates a moderate level of informativeness with approximately 65% of the markers falling within the range of PIC values of 0.4–0.5. This level is higher compared to previous studies in cassava, where PIC values ranged from 0.18 to 0.26 (de Oliveira et al. 2014; Ogbonna et al. 2021; de Albuquerque et al. 2018). However, similar results were observed by Eltaher et al. (2018) with SNP markers. The higher PIC values of the SNP markers used in this study suggest that they are suitable for conducting genetic studies in cassava accessions. The PIC reflects the level of diversity captured by the markers, with higher values (0.4–0.5) indicating greater variability among the studied genotypes. In addressing CBSD and CMD challenges, the significance of using SNP with higher PIC values facilitates effective breeding strategies for developing disease-resistant varieties.

The population structure analysis identified five distinct sub-populations within the cassava panel, indicating significant differentiation among genotypes. The composition of each sub-population was strongly associated with the geographical origin of the materials and/or pedigree, given that all genotypes originated from biparental populations whose parents hailed from West Africa, East Africa, and Southern America. This kind of genetic divergence based on crop origin has been previously documented (Adjebeng-Danquah et al., 2020; Adu et al. 2021). Principal component analysis (PCA) further confirmed the genetic relationships among accessions in the diversity panel. The clustering of accessions in PCA biplots closely mirrored the results of structure analysis using ADMIXTURE, consistent with findings from previous studies (Adu et al. 2021; Soro et al., 2023). However, the total genetic variation explained by the first two principal components (PCs) was relatively lower in this study (~ 24.2%) compared to 38.2% and 48.13% in previous studies (Adu et al. 2021) and (Soro et al., 2023) respectively. This discrepancy may be attributed to the narrower genetic background of the population derived from biparental populations. Understanding population structure helps in selecting parental lines for hybridization, enhancing genetic recombination and introducing novel combinations of resistance alleles ultimately broadening the genetic base and enhancing the resilience of new varieties to disease pressure (Uba et al. 2021). In the current study, the population structure of cassava genotypes based on SNP data reveals clusters or groups of genotypes with similar genetic backgrounds. Certain genetic clusters may exhibit higher levels of resistance to cassava mosaic disease (CMD) and cassava brown streak disease (CBSD) due to shared genetic traits. Breeders can leverage this information strategically to incorporate disease-resistance traits into new varieties, thereby improving the overall sustainability and productivity of cassava cultivation.

The NJ and hierarchical classified grouped the 155 cassava genotypes into three major groups revealing a shared gene pool within each cluster with Cluster 1 comprising of pop1, pop3, and pop5 whose cassava genotypes shared common parents. Additionally, the average observed heterozygosity (HO) was determined to be 0.32, aligning with findings from similar studies, which reported Ho of 0.33 (Ferguson et al. 2019b) and 0.32 (Ogbonna et al. 2021) respectively. Analysis of heterozygosity within distinct sub-populations significantly revealed diversity, suggesting potential hybridization and selective processes among the genotypes. This diversity underscores the adaptability and resilience of the species to environmental changes, diseases, and pests, as proposed by Goulet et al. (2017). Such genetic diversity serves as a valuable resource for breeders, offering to develop new varieties with enhanced traits and increased resistance to various stresses.

The differentiation among the groups was assessed using Fst index, which estimates genetic differentiation among groups. According to Wright (1968), Fst values are categorized as high (> 0.25), moderate (0.15–0.25), and low (< 0.05). This index serves as an estimate of genetic flow between sub-populations and, when considered alongside heterozygosity, can directly influence genetic differentiation. In this study, moderate Fst values (0.15–0.25) between certain population pairs (e.g., pop 1 and 3, pop 2 and 3, pop 3 and 4) indicate substantial genetic differentiation. These findings suggest the presence of distinct genetic clusters within cassava populations, which may harbor unique genetic variations, including traits related to disease resistance against CBSD and CMD. Moderate Fst values between population pairs highlight opportunities for targeted breeding efforts. Populations exhibiting moderate genetic differentiation can serve as potential sources of diverse genetic material for developing disease-resistant cassava varieties. The observed Fst values reflect the influence of geographical isolation and parental lineage on genetic differentiation. For instance, populations originating from geographically isolated regions (e.g., Africa and South America) show higher genetic differentiation (moderate Fst), indicating limited gene flow and potential genetic adaptation to local environments. Conversely, populations derived from the same parents (e.g., pop 2 and pop 5) exhibit lower differentiation (low Fst), suggesting recent common ancestry and potential sharing of genetic traits, including disease resistance genes. Breeders can leverage populations with moderate genetic differentiation to introduce novel genetic combinations and enhance disease resistance traits in new cassava varieties.

Conclusion

The SNP markers identified in this study exhibit high polymorphism and informativeness rendering them valuable for assessing diversity within cassava populations. These markers effectively classified cassava populations into sub-groups based on their allele variants. The moderate to high genetic diversity observed across the cassava populations suggests the presence of valuable alleles associated with desirable traits. Integrating these populations into cassava improvement efforts could facilitate the development of user-friendly breeding programs for farmers and stakeholders across the value chain. This research contributes to germplasm management and the establishment of core collections aimed at improving existing varieties.

In summary, the detailed distribution of SNPs across chromosomes and the assessment of polymorphic information content (PIC) significantly enhances our understanding of the genetic makeup of cassava. This groundwork sets the stage for future investigations into cassava diversity and its potential applications in breeding programs and genetic studies. The markers successfully differentiated cassava populations into distinct sub-groups based on their genetic variants. The observed range of moderate to high genetic diversity in genotypes underscores the presence of valuable alleles for desirable traits. The integration of these genotypes into cassava improvement efforts holds promise for developing enhanced varieties through breeding programs.