Introduction

Cassava (Manihot esculenta Crantz) is an economically important, staple tropical root crop. It is one of the most important dietary energy sources in most tropical countries (Ceballos et al. 2020). Cassava is robust, resistant to droughts and floods, and grows well in degraded soils. Historically, cassava was grown for human consumption (e.g., gari, fufu, sago, table consumption, etc.). However, demand from worldwide markets for cassava (e.g., as a starch, animal feed, or bioethanol) has strengthened over the years. Cassava is now the second most important source of starch worldwide (Stapleton 2012). The global production of fresh root equivalents in 2018 amounted to 277.1 million tons (FAO 2018). Currently, there are important breeding efforts in Brazil, Colombia, China, Ghana, India, Nigeria, Kenya, Mozambique, Tanzania, Thailand, Uganda, and Vietnam (Ceballos et al. 2020).

Since the cassava breeding cycle is long, with a low reproduction rate, it takes 7–8 years to develop a new cassava variety (Rojanaridpiched et al. 2010). The first cassava evaluations occur in seedlings, single row trials (SRT), followed by preliminary (PYT) and advanced (AYT) yield trials, and culminating in multi-location, multi-year uniform yield trials (UYT; Rojanaridpiched et al. 2010; Gracen et al. 2018; Ceballos et al. 2020). Genetic gains from cassava breeding efforts have been very limited in the last century compared to other crops (Ceballos et al. 2004). Factors limiting cassava’s breeding efficiency include its heterozygous progenitors, long breeding cycles, clonal propagation, and non-recovery of recurrent genomes after single trait introgressions (Ceballos et al. 2016; Kuon et al. 2019). Although cassava is predominantly clonally propagated, it outcrosses with plants still capable of sexual reproduction. The intentional or unintentional inclusion of seedlings into clonally propagated stock continually generates new genotypes within a population, thus increasing haplotypic variation (McKey et al. 2010). Storage root yield is the key objective in cassava breeding programs. Farmers attach the highest importance to good storage root yields with high dry matter content (DMC) in the wild-type and waxy (amylose-free) cassava starch widely used in food and industrial applications (Sanchez et al. 2010; Aiemnaka et al. 2012). Cassava starch has many remarkable characteristics that are advantageous for industrial applications, including high paste viscosity and clarity and high freeze–thaw stability (Nwokocha et al. 2009; Sanchez et al. 2010; Toae et al. 2019).

Genome-wide association study (GWAS) is a powerful approach based on the linkage disequilibrium (LD) resulting from the association of target traits and haplotype loci. GWAS identifies genes, alleles, or haplotypes related to specific agronomic traits in complex environments and has been successfully used in many plant studies, including studies of wheat (Juliana et al. 2018), maize (Zhang et al. 2015), rice (Begum et al. 2015), and tomatoes (Sauvage et al. 2014). In cassava, GWAS has been successfully used to dissect the genetic basis of several traits (Rabbi et al. 2020), including dry matter content, total carotenoid content (Rabbi et al. 2017; Ikeogu et al. 2019), resistance to CMD (CMD2), phytoene synthase (PSY), root number, shoot weight, harvest index (Rabbi et al. 2017; Okeke et al. 2017; Zhang et al. 2018; Somo et al. 2020; Yonis et al. 2020), cassava green mite resistance (Ezenwaka et al. 2018), cassava brown streak disease resistance (Kayondo et al. 2018), cassava mosaic disease resistance (Wolfe et al. 2016), and cyanide content (Ogbonna et al. 2021).

Genomic selection (GS) has been proposed as a promising tool to overcome breeding challenges (Meuwissen et al. 2001). GS is an alternative to traditional marker-assisted selection (MAS) for quantitative traits (Jannink et al. 2010; Hickey et al. 2017). GS aims to combine genome-wide molecular markers and phenotypes in a training population to predict the genetic values of future individuals for selection purposes. Since no significance test is required, eliminating biases in marker effect estimates, the breeding cycle is accelerated (Desta and Ortiz 2014). Contrary to MAS, GS is suitable for quantitative traits controlled by a large number of small-effect genes. GS has been introduced in many aspects of plant breeding, such as inbred performance predictions and hybrid predictions (Riedelsheimer et al. 2012; Crossa et al. 2014; Xu et al. 2014; Wang et al. 2017; Xu 2017). Accurate predictions are essential for the successful application of GS. The predictability (i.e., the prediction accuracy) obtained from cross-validation in training populations has been previously evaluated in maize, wheat, barley, and cassava (VanRaden 2008; Crossa et al. 2017; Wolfe et al. 2017; Andrade et al. 2019). These studies indicated that predictability is affected by various genetic factors, including heritability, relatedness, sample size, marker density, and genetic architecture. The predictabilities of low heritability traits, such as yield, were consistently lower than high heritability traits, such as kernel weight and plant height (Xu et al. 2018).

In addition to genetic factors, statistical factors influence predictability. Parametric methods widely used in GS to predict genetic values include genomic best linear unbiased prediction (GBLUP; VanRaden 2008), least absolute shrinkage, selection operator (LASSO; Tibshirani 1996), and partial least squares (PLS; Gelandi and Kowalski 1986), and nonparametric methods, including random forest (Svetnik et al. 2003), neural networks (NN; Ehret et al. 2015), support vector machines (SVM; Maenhout et al. 2007), and reproducing kernel Hilbert spaces (RKHS) regressions (de Los Campos et al. 2010). Several investigators have compared the predictive performance of these methods using simulation and empirical data (Heslot et al. 2012; VanRaden 2008; Riedelsheimer et al. 2012; Howard et al. 2014; Wang et al. 2015). The key limitation that GS has in the case of cassava is the use of heterozygous progenitors that distinguish cassava from most of the crops (Ceballos et al. 2016 and 2020) where GS has been proven to be useful as highlighted and the strong nonadditive genetic effect for fresh root yield as reported by articles published by Wolfe and co-workers (2017, 2019).

The objectives of this study were to (1) understand the genetic architecture of yield-related traits and starch pasting property traits in a cassava breeding population and identify novel QTLs loci associated with the above traits using GWAS; (2) explore the potential utility of using GS for future cassava breeding programs.

Materials and methods

Plant materials and field trials

We obtained 276 cassava genotypes, including both 247 wild-type and 29 waxy cassava starch types, from the germplasm collection developed at the Department of Agriculture, Kasetsart University, Bangkok, Thailand, Thai Tapioca Development Institute (TTDI). The cassava genotypes were from International Center for Tropical Agriculture (CIAT), International Institute of Tropical Agriculture (IITA), and other breeding programs (Supplementary Table S1).

Each clone was represented by 10 plants in a single row spaced with 1 m × 1 m apart. KU50 and HB80, two commercial varieties, were planted after every ten rows in incomplete blocks for 2 consecutive years at Tapioca Development Institute (TDI) station (2015 and 2016; experimental stations: TDI (15.1577245 101.5028433; N15°9.46347, E101°30.170598) and Nakhon Ratchasima Province, Thailand). The plots were treated with the fertilizer N–P–K: 15–7–18 at 312.5 kg ha−1. Weed control was primarily manual 1 month after planting. The plants were planted in March or April before the rainy season, then harvested 10 to 12 months later in the following year.

Phenotypic evaluation and statistical analysis

The total fresh root weight (FRW) of the storage roots harvested per row was measured in kilograms. The starch content (SC; %) was measured by Riemann balance, 5 kg of bulked roots for each genotype from single row plot was weighted in air and immersed in water (Kawano et al. 1987). The harvest index (HI) was computed as the total fresh root weight to total biomass at harvest ratio. The average number of root per plant (RP) was computed as total number of root in plot divided by number of harvested plant. The mixed model augmented design with un-replicated entries used in this study.

The model was

$$Yij = \mu + \beta i + \tau j + \varepsilon ij$$

where β and τ are the effects of blocks and entries, respectively. Broad-sense heritability (H2) was estimated using BLUPs and the formula:

$$H2 = \sigma 2g/ \left( {\sigma 2g + \sigma 2e/r} \right)$$

where σ2g is the genotype variance, σ2e is the residual variance, and r is the number of environments. The variance components were calculated using the lmer function in the R package (lme4, v1.1–7; Bates et al. 2015). The variance component of combined analysis for year and locations was estimated using the REML method explained above for each genotype which were used in GWAS and GS analysis. The genetic correlation between traits was estimated using best linear unbiased predictions (BLUPs) of measured traits. The Pearson correlation was performed using the ggcorrplot package in R.

Starch pasting properties

Starch extraction

Starch extraction was prepared following Chaengsee et al. (2020). Washed fresh cassava roots were peeled and the upper and lower edges were cut off and then were chopped into small pieces before being crushed in water at a ratio of 1:2 of cassava to water. The crushed paste was separated from the water by passing through a cloth and 90 μm screen, respectively. The paste was washed in water before being oven-dried at 50 °C for 24 h. The dried starch was milled and screened through a 90 μm screen before further analysis.

RVA profile parameters

Viscosity profiles of wild-type and waxy cassava starch dispersions were analyzed with a Rapid Visco Analyzer (model RVA-4 Series, Newport Scientific, Warriewood, Australia) according to the method of (1995). Starch samples (2.58 g, dry basis) were dispersed in distilled water to prepare 9.21% starch suspension with a total weight of 28 g. Viscosity was recorded under the temperature profile: The samples were started at 50 °C 1 min then heated from 50 to 95 °C with a heating rate increase of 12 °C/min and then maintained at 95 °C for 2.5 min. The hot paste was subsequently cooled to 50 °C with a cooling rate of 12 °C/min and then held for 2 min (Standard program No. 1). Suspension was constantly stirred at 160 rpm, and measurements for Peak viscosity (PV), trough viscosity (TV), breakdown viscosity (BD), final viscosity (FV), setback from trough (SB), pasting temperature (PT), and peak temperature (PeT). The wild-type and waxy cassava’s phenotypic distribution and starch pasting properties were visualized with the R package “ggplot2” (Wickham, 2016).

Genotyping‐by‐sequencing

Sequence alignment, SNP calling, and SNP imputation

DNA extraction and genotyping-by-sequencing

Genomic DNA was extracted using the CTAB method from Doyle & Doyle (1990) with some minor modifications. Briefly, we applied the chloroform:isoamyl alcohol (24:1) extraction step twice to improve the removal of the phenolic compounds. The DNA samples were then quantified using the NanoDrop D-1000 per the manufacturer’s instructions. The DNA samples were diluted to 20 ng/μl, then subsequently used for GBS library preparation. The GBS library construction protocol for the germplasms followed Elshire et al. (2011). The GBS library was performed at the Institute of Genomic Diversity (Cornell University, Ithaca, NY, USA) for 101‐cycle single‐end sequencing on one lane of a 16‐lane flow cell Illumina HiSeq 2000 (Illumina).

The sequencing reads were processed with the GBS Discovery Pipeline for species with a reference genome implemented in TASSEL version 3.0 (Bradbury et al. 2007) following pipeline documentation (Glaubitz et al. 2014). In conjunction with the barcode adapter key file, the pipeline identified high-quality, unique sequence reads (termed tags), which contained a barcode, a cut site, and an inserted genomic sequence. The pipeline then merged these tags, indexed them, and aligned them to the reference genome. The sequence tags for our GBS library were aligned to the version 1.0 release of the cassava V6 reference genome (Bredeson et al. 2016) by implementing the Burrows–Wheeler alignment (Li and Durbin 2009) command of the pipeline. The TASSEL 3.0 Discovery SNP Caller (Glaubitz et al. 2014) was used to align the multiple sequence tags from the same physical locations across the genome, to call SNPs at these locations across the individual samples, and to output this data into one HapMap format file (.hmp.txt) per chromosome for downstream analysis. Missing SNP data were imputed with the FILLIN algorithm (Swarts et al. 2014) in TASSEL version 5.0 (Bradbury et al. 2007).

Genome‐wide association study

GWAS was performed using trait BLUPs for 276 genotypes. To discover any associations between the genome‐wide GBS SNPs and trait phenotypes in our germplasm, we used the compressed mixed linear model (CMLM; Zhang et al. 2010), the multi‐locus mixed model (MLMM) developed by Segura et al. (2012), and an enhanced version of the fixed and random model circulating probability unification (FarmCPU) method (Liu et al. 2016), Bayesian-information and linkage-disequilibrium iteratively nested keyway (BLINK) method was employed to evaluate the dichotomous traits of waxy and wild-type starch phenotype, implemented in the Genome Association and Prediction Integrated Tool (GAPIT) R package (Lipka et al. 2012) in R version 3.4.2 (R Core Team 2017). The percentage of phenotypic variance explained (PVE) by significant SNPs was calculated using lm function in R to obtain an adjusted R2.

We determined significant associations for each trait by adjusting the raw p value using the conservative Bonferroni error rate control method. Manhattan and Q–Q plots were visualized using R “CMplot” (Yin et al. 2021). The pairwise LD between the genome‐wide markers on each chromosome was calculated using TASSEL v5.2.20 (Bradbury et al. 2007). Marker pairs with statistically significant LD (p < 0.05) were considered in the LD analysis. The r2 values were plotted against distance (Mb). A LOESS smoothing line was fitted using R software package ggplot2 (Wickham, 2016). The Bonferroni‐corrected threshold (α = 1, − log10 (P) ≥ 6.25) was established at a probability of 0.05/89,943 = 5.55 × 10−7 (equivalent to a − log10P score of 6.25), which was used as the cut‐off for the 89,943 GBS markers as being identification significant SNP‐trait associations (Yang et al. 2014). Narrow-sense heritability estimates were computed from the variance components estimated by the genomic mixed models fit by TASSEL v5.2.20 (Bradbury et al. 2007), which compute an additive genetic relationship (i.e., kinship) matrix, as explained in Endelman and Jannink (2012).

Identification of candidate genes

We identified candidate genes within the associated genomic region using the significant GWAS results and the phytozome 12 portal link to biomart (https://phytozome.jgi.doe.gov/biomart/ accessed 15 Oct. 2019). We searched for genes located in windows of ± 50 kb around the most significant SNPs.

Genomic prediction methods

Evaluation of genomic prediction models using cross‐validation

The genomic prediction was evaluated using eight statistical models. The parametric models included ridge regression BLUP (rrBLUP) from the rrBLUP package (Endelman 2011), BayesA (Meuwissen et al. 2001), BayesB (Meuwissen et al. 2001), and BayesC (Habier et al. 2011). Bayesian Lasso (BL; Park and Casella 2008), and Bayesian ridge regression (BRR; de los Campos et al. 2013) were executed using the BGLR R package developed by Pérez and de los Campos, (2014). Semi-parametric models based on RKHS methods were run using the BGLR R package (de los Campos et al. 2009, 2010). A nonparametric, random forest classification (RFC; Breiman, 2001), a machine learning method used for regression and classification, has been successfully used for predictions (Wolfe et al. 2017; González‐Camacho et al. 2018). We implemented RFC using the “randomForest” package in R (Liaw 2013).

Predictability or accuracy

The predictability for cassava performance was evaluated using a tenfold cross-validation. The sample was randomly partitioned into 10 parts: nine parts used to estimate parameters, and 1 part was held out of the model, and subsequently predicted and used to estimate accuracy. Random sample training and validation sets were repeated 30 times, and the means of the Pearson’s correlation coefficients were defined as the genomic prediction ability. The Studentized range statistic, Tukey’s honest significance test (R Core Team 2017) was used to test for significant differences in the predictive abilities of the eight models.

Results

Phenotypic analysis

Figure 1 shows the phenotypic distributions of the yield-related traits. Table 1 provides descriptive statistics of fresh root weight, percentage of starch content, number of average roots, and harvest index. The fresh root weight varied widely in the population, ranging from 0.1 to 11.3 kg/plant. The percentage of starch content ranged from 3.8 to 30.1%, with an average of 16.4%. The harvest index ranged from 0.1 to 0.9, with an average of 0.54. The number of average roots ranged from 2 to 27, with an average of 11 roots per plant. The broad-sense heritability was moderate to low for harvest index, fresh root weight, root number per plant, and percentage of starch content (H2 = 0.57, 0.43, 0.40, and 0.33, respectively). phenotypic variations of yield-related traits were affected more environmental factors than genetic factors. The pasting properties varied significantly among the different lines. The narrow-sense heritability ranged from 0.58 for SB to 0.85 for PV (Table 1). These results indicate that the phenotypic variations in the starch pasting properties were mainly affected by genetic factors and that this panel can be used for further genetic analyses.

Fig. 1
figure 1

Frequency distribution of yield-related traits: number of root per plant (RP), fresh root weight (FRW), harvest index (HI), starch content (SC), and starch pasting property traits: peak RVU, trough RVU, breakdown RVU, final viscosity RVU, setback from trough RVU, pasting temp and peak temp. Each panel shows the distribution of the best linear unbiased estimates of the 276 cassava panel used for this study

Table 1 Summary information for phenotypic data of the four yield-related traits and starch pasting properties

Trait correlations of BLUPs for yield-related traits FRW with SC, HI, and RP were low ranging from 0.09 to 0.2. SC was positively correlated with RP (r = 0.56, p < 0.001) (Fig. 2). FRW was not significantly correlated with starch properties traits except for starch pasting temperature (r = − 2.4, p < 0.001). The correlation between HI and starch pasting properties traits was positive ranging from 0.48 to 0.57 (p < 0.001). There were moderately to highly significant positive correlations among starch properties traits with values ranging from 0.39 to 1.0 (p < 0.001).

Fig. 2
figure 2

Heatmap of genetic correlation for yield and starch properties-related traits using BLUP values

Starch pasting properties

The pasting characteristics and paste viscosity profiles of starch in 9.21% w/w analysis concentration are given in Table 2 and Fig. 3a. The pasting temperatures of the wild-type cassava starches were similar to the waxy cassava starches (68.45–75.25 ℃ and 67.75–73.36 ℃, respectively). The peak viscosity of the wild-type cassava starches (327–527 RVU) was significantly higher than those of the waxy cassava starches (200–243 RVU). The breakdown viscosity of wild-type cassava starches (191–379 RVU) was higher than those of waxy cassava starches (102–140 RVU). After cooling, the final viscosities of the wild-type cassava starches were significantly higher than those of the waxy cassava starches (179–256 and 133–151 RVU, respectively). The final viscosities of the wild-type cassava starches (56–99 RVU) were slightly higher than those of the waxy cassava starches (36–55 RVU).

Table 2 Pasting properties of waxy cassava starch compared to wild-type cassava starches
Fig. 3
figure 3

a Starch paste viscosity profiles of wild-type and waxy starches of 276 cassava were dispersed in distilled water to prepare 9.21% starch suspension. A genome-wide association study identified loci affecting starch pasting properties in the cassava panels. b Box plots show the estimated genotypic effect on the phenotypes in the study. The significantly associated SNPs and each trait have been correlated to show the impact each genotype has on the average of the overall traits

Population structure and linkage disequilibrium

Sequence reads were aligned to the cassava V6 reference genome (Bredeson et al. 2016), and the SNPs were then called, imputed in regions of low coverage, and filtered to give a total of 327,919 SNPs detected across the population (an average of 1.62 SNP per 1 Kbp) (Fig. 4a).

Fig. 4
figure 4

a SNP density plot on 18 chromosomes of cassava within 1 Mb window size. b Scatter plot (PC1 vs PC2) coding by any major genetic groups in the data dark colored by waxy vs. wild-type starch

A total of 327,919 SNPs were filtered with minor allele frequencies less than 1% (> 0.01, missing data per site < 90%), resulting in 88,934 SNPs. Population structure analysis of the 276 accessions genotyped across the 88,934 SNPs using PC analysis detected subtle genetic differentiation in the germplasm panel, with the first 10 PCs explaining about 25.25% of the genetic variation. The first two PCs accounted for 8.38% and 6.25% of the genetic variation, respectively (Fig. 4b). The average LD decay was not uniform across all chromosomes (Fig. 5). The lowest per-chromosome LD decay was observed on chromosome 11 (24 Kbp) and the largest on chromosome 5 (180 Kbp) (r2 < 0.2).

Fig. 5
figure 5

LD decay by chromosome. LD plots for each chromosome and the mean LD for 18 chromosomes. Inset is the distance (bp) where LD drops below r2 of 0.2 for each chromosome

GWAS analysis

Yield-related traits

We analyzed genotype–phenotype associations in 276 cassava genotypes and 12 traits related to yield, starch pasting properties and waxy and wild-type starch with 89,943 SNPs using CMLM MLMM FarmCPU and BLINK for waxy and wild-type starch analysis in GAPIT. Three different models were used to compare the marker–trait association tests. These three models identified different numbers of significant markers associated with FRW, SC, HI, and RP traits for cassava when the same significance threshold was used (Table 3). Based on the Q–Q plots for all models, FarmCPU model was used for yield-related traits (Fig. 6 and Supplementary Figures S1, S2).

Table 3 Most significant SNPs associated with four yield-related traits and seven starch pasting property traits identified fitting the FarmCPU and BLINK in GAPIT and annotations of candidate genes harboring the significant SNPs using the phytozome 12 portal link to the biomart (https://phytozome.jgi.doe.gov/biomart/)
Fig. 6
figure 6

Manhattan plot of SNP markers associated with FRW, SC, HI and RP the quantile–quantile (Q–Q) plots in cassava using the fixed and random model circulating probability unification (FarmCPU; yellow dot), compressed mixed linear model (CMLM; navy blue), and the multi‐locus mixed model (MLMM; blue dot) (color figure online)

The GWAS analysis on starch pasting property traits was analyzed using only the CLMM model and dichotomous trait of waxy and wild-type starch traits was analyzed using BLINK. A total of eight SNPs were significantly associated with yield-related traits (Fig. 6) with the contribution of phenotypic variance explained (R2) by a single QLT ranging from 2.54 to 25.32% (Table 3). For FRW, one SNP marker (S1_32060063) on chromosome 1 did not reach the significant threshold but showed a − log10 p = 5, and the variance explained 11.82%. Three SNPs were significantly associated with SC and were distributed on chromosomes 1 (S1_23945110), 6 (S1_172308681), 18 (S1_502313244). The significant markers associated with the trait were mostly concentrated in a single region on chromosome 18. The most significant SNP marker (S1_502313244) had a − log10 (p value) of 7.66. The most significant marker explained 10.61% of the phenotypic variance. Five SNPs were significantly associated with HI and were distributed on chromosomes 1, 4, 7, 14, and 17. The most significant SNP marker (S1_478075251) had a − log10 (p value) of 7.35. The marker explained 25.32% of the phenotypic variance. A marker on Chromosome 4 did not reach the significant threshold for RP but had a (− log10 (p value) of 4.96.

Starch type and starch pasting properties

A total of 46 QTLs were associated with wild-type and waxy starch (Table 3 and Fig. 7). Most QTLs were associated starch type and distributed on chromosome 2. The top significant SNP marker (S1_36202626) had a − log10 (p value) of 12.03. The top significant SNP marker explained 56.89% of the phenotypic variance between waxy and wild-type starch. The second most significant SNP marker (S1_35043061) on chromosome 2 identified three genotypes of cassava, waxy (wxwx), heterozygous genotype (Wxwx), and wild-type (WxWx) genotype, that explained 47.04% of the phenotypic variance (Figure 3b).

Fig. 7
figure 7

Manhattan plot of SNP markers associated with starch (waxy and wild-type) implemented with BLINK method, peak RVU. Trough RVU, breakdown RVU were implemented with CMLM method and quantile–quantile (Q–Q) plots in cassava found with the fixed and random model circulating probability unification (FarmCPU; blue dot). A gray line represents the significance threshold. − Log10 (p values) = 6.25 represents the threshold from the Bonferroni correction method (color figure online)

A total of 21 significant QTLs of –log10 (p value) > 6.25 were identified for six starch properties: PV, TV, BD, FV, SB, and PT with the contribution of the variance explained (R2) ranging from 11.55 to 78.9% for each SNP (Table 3). Two SNPs were associated with PV on chromosomes 2 and 3; SNP S1_35043061 had the lowest p value (− log10 (p value) = 6.89) on chromosome 2 having the highest R2 of 77.69% (Figs. 7 and 8). Three SNPs with − log10 (p values) > 6.25 in a Manhattan plot were strong signals of association between TV on chromosomes 2 and 4 with the lowest p value recorded for S1_35043061 on chromosome 2 (− log10 (p value) = 6.36). The top significant marker explained 66.05% of the phenotypic variance. For BD, three SNPs were identified on chromosomes 2, 3 and 4, with the lowest p value found for SNP S1_76323649 on chromosome 3 (− log10(p value) = 6.77). The top significant marker explained 78.9% of the phenotypic variance. For FV, nine SNPs were identified. The one on chromosomes 2 had the lowest p value recorded for SNP S1_35043061 (− log10 (p value) = 7.23). The top significant marker explained 75.8% of the phenotypic variance. For SB, two SNPs were detected on chromosomes 2 with the lowest p value of SNP S1_35761681 (− log10 (p value) = 6.43). The top significant marker explained 76.58% of the phenotypic variance. Two significant SNPs, distributed on chromosomes 2 and 3, were associated with PeT, with the lowest p value recorded for SNP S1_35043061 on chromosome 2 (− log10 (p value) = 6.89). The top significant marker explained 78.9% of the phenotypic variance. Two SNPs S1_35043061 on chromosome 2 and S1_76323649 on chromosomes 3 were associated with PV, TV, BD, BV, FV, SB, and PeT, respectively. The favorable SNP alleles exhibited significantly different starch pasting properties traits and are shown in Fig. 3b.

Fig. 8
figure 8

Manhattan plot of SNP markers associated with final viscosity RVU, setback RVU, pasting temp RVU, peak temp RVU, and quantile–quantile (Q–Q) plots in cassava found with the CMLM method. A gray line represents the significance threshold. − Log10 (p values) = 6.25 represents the threshold from the Bonferroni correction method

Genomic prediction

Genomic prediction accuracies for the yield-related traits and starch properties were evaluated with all eight models. Table 4 summarizes rrBLUP, BL, BRR, BayesA, BayesB, BayesC, RKHS, and RF. The prediction accuracies for RP using tenfold cross-validation were 0.64–0.71 using the BL model (Fig. 9). The prediction accuracy of HI was 0.59 using the rrBLUP model. All models had almost the same prediction accuracy of around 0.40 to 0.55 for SC. FRW had low prediction accuracies (− 0.001–0.09) for all models. The genomic prediction accuracies for starch pasting properties were 0.7–0.8 for peak RVU, trough RVU, breakdown RVU, final viscosity RVU, and setback RVU for all models (Fig. 10). Pasting temperature and peak temperature had prediction accuracies between 0.5 and 0.7 in all models. The RFC model had the lowest predicted accuracy values compared to all model and starch property traits.

Table 4 Average predictabilities and their standard error (SE) for 11 traits using eight prediction models
Fig. 9
figure 9

Predictive yield-related traits based on the Pearson correlations between GEBVs and true breeding values from eight statistical models using tenfold cross-validation with 9:1 for the training set and a validation set with 30 replications. Different capital letters above the group labels indicate significant differences between groups

Fig. 10
figure 10

Predictive ability of starch pasting property traits based on the Pearson correlations between GEBVs and true breeding values using eight statistical models and tenfold cross-validation with 9:1 for the training set and a validation set with 30 replications

Discussion

Yield-related traits, such as FRW, SC, RP, and HI, were highly variable among the cassava genotypes (Table 1). The heritability estimates were FRW (0.43), SC (0.33), RP roots average per plant (0.40), and HI (0.57) narrow-sense heritability h2 values ranged from 0.1 to 0.51 (Table 1). These observations are consistent with similar studies concerning yield-related traits in cassava (Wolfe et al. 2017; Andrade et al. 2019). In clonally propagated crops, nonadditive genetic effects can be effectively exploited by identifying superior genotypes as new varieties. For this reason, Wolfe et al. (2016) quantified the amount and nature of the nonadditive genetic variation in key traits in a GS breeding population of cassava. Previous findings in cassava plants were based on diallel crosses and the nonadditive genetic variation was significant, especially for yield traits (Cach et al. 2005; Calle et al. 2005; Jaramillo et al. 2005; Pérez et al. 2005; Tumuhimbise et al. 2014; Ceballos et al. 2015). Our results indicated nonadditive (mainly epistasis) variance for fresh root yields and mostly additive inheritance for root SC. These findings confirm the conclusions of numerous diallelic studies conducted with cassava germplasms from Latin America (Cach et al. 2005; Calle et al. 2005; Jaramillo et al. 2005; Pérez et al. 2005) and Africa (Zacarias and Labuschagne 2010; Kulembeka et al. 2012; Tumuhimbise et al. 2014; Wolfe et al. 2016a). Previous studies of the general and specific combining abilities of cassava showed that fresh root yield, root number, harvest index, and plant height traits were predominantly controlled by nonadditive effects. Additive effects governed dry matter content and root diameter (Kawano 2003; Ceballos et al. 2016). Thus, the SC or dry matter content is a trait that allows greater predictive ability, even in earlier breeding stages. In contrast, parental selection for fresh root yield is more complex, requiring more refined strategies for analysis and selection (Kawano 2003; Ceballos et al. 2016). Molecular studies have also found strong nonadditive genetic effects for complex traits such as fresh root yield (Wolfe et al., 2016b) The heritability of each starch property-related trait was estimated from genetic data only. The SNP-based heritability (narrow-sense heritability) of the starch property traits had high values of 0.58–0.85. There was a significant contribution of additive genetic factors in the expression of these traits. The heritability estimates recorded in our study also indicated good repeatability and reproducibility of our experimental procedures.

The pasting temperatures of the wild-type cassava starches and the waxy cassava (68.45–75.25 ℃ and 67.75–73.36 ℃, respectively) correspond to similar pasting temperatures reported for cassava starches in other studies: 69–73 ℃ (Toae et al. 2019; Chaengsee et al. 2020) and 67.9–74.4 ℃ (Aldana and Quintero 2013). The peak viscosity of the wild-type cassava starches (327–527 RVU) was significantly higher than those of the waxy cassava starches (200–243 RVU). Peak viscosity values in this research were different from those in Sánchez et al. (2010) due to the different unit formats. In Sánchez et al (2010), peak viscosity values were reported in centipoise (cP) unit in which 12 cp equivalent to 1 rapid viscosity unit (RVU). Therefore, when data from Sánchez et al. (2010) were transformed into RVU, there were 73–81.58 RVU for normal cassava starch and 93.25 RVU for waxy starch in 5% suspension solution. There were 70–76 RVU for normal cassava starch and 85–107 RVU for waxy starch in 5% suspension solution in this research which were relatively similar to those reported in Sánchez et al. (2010). At the same analysis concentrations (9.21% w/w), Schirmer et al. (2013) reported that wild-type potato starch showed higher peak viscosity than waxy potato starch. However, at the different analysis concentration (5% w/w), Toae et al. (2019) reported that cassava waxy starches (116–131 RVU) had higher peak viscosity than those of wild-type cassava starches (80–94 RVU). Wang et al. (2017) explained that the lower peak viscosity of rice waxy starch in 9.2% starch suspension compared to wild-type starch was because when the suspension concentration was higher than the close packing concentration of starch, less particle rigidity of partially swollen starch granule of waxy type than wild type will result, and the reduced rigidity during heating led to lower peak viscosity. Evans and Lips (1992) reported that the close packing concentration for tapioca is 2.27%. Therefore, a 9.2% suspension can be applied to distinguish between waxy and wild-type starch of cassava. Even for 5% starch suspension, Raemakers et al. (2005) reported that waxy cassava that was mutated by antisense inhibition of GBSSI had lower peak viscosity than that of wild type. For other pasting properties in our study, the breakdown viscosity of wild-type cassava starches (191–379 RVU) was considerably higher than those of waxy cassava starches (102–140 RVU), indicating their lower resistance to high temperatures and increased sensitivity to shearing stress. After cooling, the final viscosities of the wild-type cassava starches were significantly higher than the waxy cassava starches (179–256 and 133–151 RVU, respectively), indicating that the retrogradation tendency of the wild-type cassava starches was higher than the waxy cassava starches. The final viscosities of the wild-type cassava starches (56–99 RVU) were slightly higher than those of the waxy cassava starches (36–55 RVU). The lower breakdown and setback viscosity values indicate greater starch paste stability and a lower retrogradation tendency of waxy cassava starches. The lower peak viscosity of waxy starch can be explained by the particle rigidity of partially swollen granules. The hypothesis that a pleiotropic effect of waxy gene has a yield penalty was raised and studied in Karlström et al. (2016) in 87 waxy and 87 wild-type counterpart cassava lines from eight full-sib families showing that waxy lines had 0.8% lower dry matter content than the wild-type counterpart. However, there is no evidence of a pleiotropic effect of waxy gene on starch pasting properties. There were reports of higher fat and water soluble carbohydrate content in waxy compared to wild type in wheat (Yasui and Ashida 2011) and rice (Pérez et al. 2019), but starch contents in both studies were reported to be inconsistent and unchanged compared to wild type, respectively. In this study, there were no significant SNPs detected for SC on spontaneous mutants of GBSSI gene (Aiemnaka et al. 2012). However, the preliminary results of waxy cassava had low SC or DMC comparing with commercial varieties. Therefore, it is possibly due to the linkage between the waxy gene and a locus for SC or DMC. Then, the further crosses among waxy cassava (wxwx) or improved heterozygous genotypes (Wxwx) to select for waxy with higher SC or DMC could help increase the chance to breaking the linkage waxy starch and low desirable SC.

Linkage disequilibrium

LD in cross-pollinated species decays more rapidly than in self-pollinated species due to the more effective recombination that occurs in the former (Flint-Garcia et al. 2003). We observed rapid LD decay in the 276 genotypes with a r2 < 0.2 at approximately 243 kb, which was higher than in Brazilian germplasms (~ 107 kb; r2 < 0.1; Andrade et al. 2019), Brazilian cassava (~ 20 kb; r2 < 0.2; Albuquerque et al. 2018), East–West African (~ 50 kb; r2 < 0.2) datasets (Wolfe et al. 2016), HapMap (3 kb; r2 = 0.1; Ramu et al. 2017), and elite IITA cassava breeding genotypes (10 kb; r2 = 0.212; Rabbi et al. 2020), and lower compared to West African (~ 2 Mb; r2 < 0.1; Rabbi et al. 2017). Given the allogamous nature of cassava, the LD pattern may reduce prediction accuracy since LD is crucial for GS breeding (Jannink 2010; Yabe et al. 2018); however, the LD observed in this study indicates the potential for improved accuracies when using this population for GS breeding.

GWAS and candidate genes

We scanned for associations between SNPs using CMLM, BLINK, and FarmCPU to identify genomic regions associated with yield-related traits, starch type of waxy, and wild-type starch and starch pasting property traits in cassava, respectively. The MLM model controls the P value inflation well but leads to false negatives, which weakens the identification of true associations (Zhang et al. 2010). To deal with this problem, we used the compressed MLM model (CMLM), which clusters individuals into groups and fits the genetic values of groups as random effects in the model (Zhang et al. 2010). The CMLM method improves statistical power compared to regular MLM methods (Zhang et al. 2010). FarmCPU is a multi-locus model developed to control false positives without false negatives (Liu et al. 2016). The FarmCPU model uses a modified MLM method, MLMM, and incorporates multiple markers simultaneously as covariates in a stepwise MLM to partially remove the confusion between testing markers and kinship. To eliminate this confusion, MLMM is divided into two parts: a fixed-effect model (FEM) and a random effect model (REM) and uses them iteratively. The FEM uses single marker testing and multiple associated markers as covariates to control false positives. The associated markers are estimated in the REM to define kinship to avoid model overfitting in the FEM. We evaluated these three models for false positives and false negatives based on the Q–Q plots. A sharp deviation from the expected P value distribution in the tail area would indicate that a model appropriately controlled both false positives and false negatives. Q–Q plots of complex models including CMLM, MLMM, and FarmCPU had a straight line with a slightly deviated tail, indicating that these models reduced false positives. We found that the FarmCPU model controlled both false positives and false negatives more effectively than the CMLM and MLMM models for yield-related traits.

Usually, many QTLs for a given trait are detected in different populations from different environments and the effect of QTLs by environment interaction is inconsistent. In previous studies, Andrade et al. (2019), Ozimati et al. (2019), and Yonis et al. (2020) detected root yield marker–trait associations. As Bonferroni correction method may cause more false-negative result (Kaler and Purcell, 2019), one QTL on chromosome 4 associated with RP and a QTL on chromosome 1 for FRW were mentioned in this study, as the − log 10 p value was quite close to the threshold value, for further marker validation with the caution that this SNP did not pass the Bonferroni threshold in the FarmCPU analysis. Wolfe et al. (2017) and Zhang et al. (2018) used a compressed MLM approach to significantly decrease the number of false positives for yield components (number of storage roots, storage roots weight, and dry mass weight), and storage root quality (dry matter content and starch content), and found one SNP locus located on chromosome 2 that was associated with fresh weight and dry mass weight. MLM-based GWAS for harvest index, the ratio of fresh root weight to total plant weight, uncovered two genomic regions that were significantly associated with the trait. The first peak is in chromosome 2, tagged by SNP S2_2809137 (p value = 3 × 10−8). The second locus occurred on chromosome 12, with SNP S12_6055806 showing the strongest association with the trait (p value = 5.4 × 10−24). MLM analysis revealed two major loci, of which only one was previously reported (Rabbi et al. 2017). The most significant locus occurred on chromosome 1 in the 24.64 Mb region, and the second locus was tagged on chromosome 12. This study used FarmCPU in GWAS for yield-related traits, instead of CMLM or MLMM (Fig. 6), which could exclude the false-positive associations. The Q–Q plots also suggested that the false-positive associations in this study were well controlled for the GWAS of the four yield-related traits.

GWAS has been applied to dissect the genetic architecture of several complex traits in cassava (Zhang et al. 2018; Rabbi et al. 2020). However, no previous studies have focused on GWAS for starch pasting properties in cassava. Here, we performed GWAS for seven pasting properties in a 276 wild-type and waxy cassava panel with 89,934 SNPs and identified 21 significant QTLs using CMLM model. The QTLs detected in this study were compared with those in previous studies that used F1 mapping population (Thanyasiriwat et al. 2014). As a result, the major genes of interest that encode the family of glycosyl or glucosyl transferases and hydrolases corresponded to the QTLs identified in previous studies. The remaining could possibly be novel QTLs (Table 3).

Notably, the detected loci may not be the actual causative loci due to false positives caused by LD or population structure. To understand the molecular basis of starch pasting properties, we further investigated the locations of associated QTLs for possible candidate genes by examining the candidate genes within 50 kb downstream and upstream of the identified QTLs (Table 3). According to functional annotations, the candidate genes involving in yield-related traits which were fresh root weight, starch content, and harvest index were in the group of DNA-binding protein such as histone H2A and WRKY DNA-binding domain and in the group of protein and ATP binding gene such as ring finger domain and F-box domain. Some specific protein function such as no apical meristem (NAM) protein, protein phosphatase 2A regulatory B subunit, LYCOPENE BETA/EPSILON CYCLASE, stress-responsive A/B barrel, plant self-incompatibility protein S1, ATP SYNTHASE, phosphoesterase, diacylglycerol kinase, and protein kinase were linked with the yield-related traits suggesting the complex of plant interaction with environmental stress signal involving with cassava yield. Some of the candidate genes or their homologs are known to be linked to starch properties. Another example is the chemical reactions and pathways involving carbohydrates and carbohydrate metabolic processes (Manes.04G057600.1). Manes.02G000500 encodes for suberin monomers biosynthesis enzyme. Suberin is the part of cell wall polymer complex. Interestingly, Manes.02G000500 located in chromosome 2 (position 83,240) which is near Manes.02G001000 encoding for GBSSI at position 117,671 and this QTL of suberin biosynthesis enzyme encoding gene was related to various traits of starch pasting properties including peak viscosity, trough viscosity, breakdown, final viscosity, and peak temperature. bHLH-MYC and R2R3-MYB transcription factors (Manes.02G002300) which involve in epi-cuticular wax synthesis were detected to link with peak viscosity and final viscosity as well (Elango et al. 2020). Manes.02G011700 encodes pectinesterase, one of the most heat-resistant enzymes, leading to gelatinization and precipitation of pectin in puree and juice with subsequent loss of the juice’s cloudy appearance (Salas-Tovar et al. 2017). Other genes related to the synthesis of fat were shown to link with starch pasting properties as well such as CDP-diacylglycerol biosynthesis (Manes.03G071000) and Beta-ketoacyl synthase (Manes.02G007700). The gene encoding DYNAMIN-RELATED PROTEIN which is GTP binding protein (Manes.02G009200) was associated with final viscosity. Dynamin protein was reported to involve the late stage of amyloplast division that determines the size of starch granule and affects starch gelatinization peak temperature (Yun and Kawagoe 2009).

Genomic prediction accuracies

The Pearson’s correlations (r) between the observed phenotypic values and the genomic estimated breeding values (GEBVs) were based on the tenfold cross-validation scheme to evaluate the efficiency of GS for yield-related traits and starch pasting property traits in wild-type and waxy cassava. Random sampling training and validation sets were repeated 30 times, and the mean of the correlations was defined as the genomic predictive ability (PA; Spindel et al. 2015; Gouy et al. 2013; Heffner et al. 2010. Usually, a large sample of training and test data sets will be generated to estimate PA, which generally results in a normal or nearly normal PA distribution. The mean or median of the distribution represents the GS predictive ability of a trait (Rutkoski et al. 2012). In this study, the mean PA was used to represent the predictive ability under a certain training population size with a combination of cassava panels, GS models, and marker sets. The PA for the yield-related traits ranged − 0.001–0.71 depending on trait, and 0.33–0.82 for the starch property traits, which was comparable to the values implemented for GS in cassava breeding programs (Wolfe et al. 2016, 2017; Okeke et al. 2017; Ozimati et al. 2018; Kayondo et al. 2018; Andrade et al. 2019; Somo et al. 2020; Yonis et al. 2020).

Generally, PA is related to the heritability. In this study, HI trait was moderately heritable (0.51; Table 1) and PA was 0.50–0.59. Whereas the heritability of FRW was low (0.10), FRW had low PA (− 0.001–0.09). These results correspond to Wolfe (2016b, 2017) and Andrade et al. (2019). Due to the higher heritability (0.58–0.85) of starch property traits, their direct phenotypic selection could be sufficient for the selection in a breeding program. In contrast, PA for FRW a higher genomic heritability is needed for predicting yet-to-be-observed phenotypes. As the heritability of the trait decreases, the ratio of environmental variance (residual) to genetic variance increases (Momen et al. 2018). For genomic evaluation programs to succeed, markers should have an acceptable level of LD with QTL so that the marker can express the QTL effects efficiently in the population. The successful implementation of genomic prediction in the process of selection is determined by the ability of the model to predict or estimate the genetic potential of new breeding lines for a specific trait. Training populations have a major impact on model performance, especially its size, genetic composition, heritability, and relatedness to the selection population (Isidro et al. 2015). FRW is an important yield-related trait in cassava. Low heritability traits such as FRW are complex traits that control by many small effects of gene interaction and genotype by environment interaction (Kawano 2003; Ceballos et al. 2016; Wolfe et al. 2016b; 2017). The additional costs of genotyping would be counterbalanced by a higher reliability for predicting FRW. However, high-density markers and analytical skills for genomic prediction for nonadditive gene effects and environmental interaction are required for low heritability traits.

GS models

In GS, the best prediction model is commonly identified by the cross-validation of phenotypic and genotypic data from a training set (Heffner et al. 2009). We tested eight models and found that the lower PA values of FRW were also reported by Wolfe et al. (2017) and Andrade et al. (2019), who reported low PA for fresh root yield (0.18–0.37, and 0.475, respectively). Overall, previous studies of crops and traits reported only slight differences in performance among the different genomic prediction algorithms (Asoro et al. 2011; Crossa et al. 2013; Heslot et al. 2012; Lorenz et al. 2012). We found that FRW, a more complex and lower H2 trait, generally had lower PA than a simpler and higher H2 trait, such as RP. This is consistent with other studies, where complex traits controlled by many loci with small effects had lower prediction accuracy than less complex traits (Hayes et al. 2010: Wolfe et al. 2016a, b). Genomic predictions should be more accurate for traits with higher H2 (Daetwyler et al. 2010; Lorenz 2013; Combs and Bernardo 2013). Among the parametric and nonparametric methods, RFC showed the lowest PA, which was not statistically significant (P > 0.05) for FRW, HI, and RP. The variation in prediction superiority for RKHS has been observed in previous studies (Wolfe 2017; Andrade et al. 2019). Empirical studies in cassava demonstrated lower predictive abilities for dry matter content of 0.24–0.68 (Wolfe et al. 2017) and 0.5655–0.5670 (Andrade et al. 2019). They did not find any great differences between the GS methods for SC and dry matter content. Although different methods were used to evaluate the starch content, our predictive ability results were higher (i.e., the specific gravity). In this study, the PA of SC was significantly different for the RFC and RKHS models.

Of the starch pasting-related traits, there were no significant differences between the parametric and nonparametric models, except for RFC, which had the lowest performance. Differences in the predictive ability of the nonparametric models could be due to the intrinsic way in which the various prediction models incorporate marker information. While models make no assumptions about gene action, nonlinearity is introduced using specific methods (Morota and Gianola 2014). In general, the performance of predictive models is known to depend on the genetic architecture of the trait under consideration (Daetwyler et al. 2010; Su et al. 2014). Although nonadditive models, including RFC and RKHS, capture dominance and epistasis effects, GBLUP is more suitable for prediction when traits are determined by a large number of unlinked and non-epistatic loci with small effects.

The cross‐validation results were mostly consistent across the breeding programs, and the superiority of one prediction method over the others was trait‐dependent. RF and RKHS usually predicted phenotypes more accurately for yield‐related traits, which are known to have a significant amount of nonadditive genetic variation (Wolfe et al. 2016). Similar findings have been made in wheat (Triticum aestivum L.) for grain yield, an additive, and epistatic trait, in which RKHS, radial basis function neural networks, and Bayesian regularized neural network models had a better predictive ability than additive models like BL, Bayesian ridge‐regression, BayesA, and BayesB (Perez‐Rodriguez et al. 2012).

Conclusions

This study aimed to identify major QTL to facilitate MAS and explore the utility of GS for yield-related and starch pasting property-related traits in cassava panel breeding programs. In this study, GWAS identified eight yield-related, peak starch type of waxy and wild-type starch and 21 starch pasting property-related traits (QTLs), respectively. We performed GWAS and evaluated GS for 12 yield and starch-related traits by utilizing data collected for 276 cassava-derived lines from TTDI, CIAT and IITA, and other advanced breeding lines. We provided a comparison of eight genomic prediction models for predictive ability. The results showed that predictabilities for the eight models were not significantly different for the yield-related traits, except that the RKHS models were better than the RFC models for SC, and the RFC model had a significantly lower predictability for the starch pasting property-related traits. Given these promising results, the implementation of MAS and GS can enable breeders to better select those quality traits in a larger population at earlier stages, thus enhancing selection efficiency. Opportunity exists to explore the genetics underlying other traits and agronomic characteristics, which will be addressed in future investigations.