Introduction

As the world population size approaches the nine billion mark, cassava (Manihot esculenta Crantz) is projected to be a cornerstone on which to rest aspirations for meeting increased calorie demands in sub-Saharan Africa (SSA), which are currently experiencing the highest population growth rates in the world (Edgerton 2009; Burns et al. 2010; Tan 2015). The starchy root crop is grown and consumed widely in tropical regions of Africa, Asia and Latin America, where it dependably provides household food security in resource-poor farming communities (Monfreda et al. 2008; Salvador et al. 2014). Globally, at least 800 million people derive the bulk of their dietary energy from cassava on a daily basis and over 500 million of these people live in SSA (FAOSTAT 2009; Montagnac et al. 2009; Burns et al. 2010). In Uganda, cassava is the second most consumed crop after banana. However, more than 90 % of the edible portion of cassava roots consists of carbohydrates (Montagnac et al. 2009), making diets that heavily depend on cassava to be deficient in essential micronutrients such as vitamin A, iron and zinc. In particular, vitamin A deficiency (VAD) has remained a widespread nutritional challenge (WHO 2009; Manson et al. 2014).

Consequently, people that depend on cassava for food are vulnerable to VAD, which is associated with corneal blindness, reduced immunity to preventable diseases, stunted growth, anaemia and reduced capacity for physical activity (Gegios et al. 2010; Stephenson et al. 2010). It was in response to this distinct challenge that a special breeding agenda, referred to as biofortification, was initiated in the early 2000s under the HarvestPlus project to develop cassava varieties enriched with provitamin A carotenoids (Pfeiffer and McClafferty 2007). In practice, consumption of provitamin A cassava can be considered a home-based therapy, as it is expected to minimise VAD that is widespread among resource-poor farmers that primarily depend on cassava (Mayer et al. 2008; Talsma et al. 2013).

However, the intrinsic heterozygous nature, long breeding cycles and high sensitivity to environmental variations make cassava a difficult crop to breed conventionally (Ceballos et al. 2004, 2015). Nonetheless, recent advances in molecular breeding technologies have proved to be useful for increasing genetic gains for complex traits in crops such as wheat (Poland et al. 2012), maize (Ersoz et al. 2009; Crossa et al. 2013) and rice (Yuan-yuan et al. 2014) and a replica of such technologies is expected to enhance genetic improvement of cassava (Oliveira et al. 2012). Indeed, a classic example for this was the pioneer marker-assisted introgression of cassava mosaic disease resistance into Latin American cassava germplasm prior to its introduction into Africa (Okogbenin et al. 2007). This is a classical form of marker-assisted selection (MAS), in which either single genes or quantitative trait loci (QTL) were successfully introgressed. In the case of cassava carotenoids, patterns of phenotypic segregation appear to suggest that the trait is controlled by more than one gene (Akinwale et al. 2010). Efforts that can pinpoint the allelic basis of such variability would, for example, enhance chances of using gene-based selection to facilitate the rapid recurrent selection scheme that has already been demonstrated for carotenoid content in cassava (Ceballos et al. 2013).

Recent advances in next generation sequencing (NGS) technologies have made linkage disequilibrium (LD)-based genome-wide association study (GWAS) a favourable tool for mapping complex traits in higher plants (Davey et al. 2011). It is now practical to use high-density genotyping platforms such as genotyping-by-sequencing (GBS) to quickly generate millions of marker data points that are distributed throughout a genome for purposes of facilitating genetic improvement in plants (He et al. 2014). For example, Owens et al. (2014) and Suwarno et al. (2015) have used these NGS platforms to identify polymorphisms linked to increased carotenoid contents in maize, which certainly provides a foundation for systematic provitamin A biofortification. With more than 97 % of the cassava genome sequenced (Prochnik et al. 2012), it is now possible to use GBS (Elshire et al. 2011) and GWAS (Oliveira et al. 2012) to precisely anchor single nucleotide polymorphisms (SNPs) linked to traits of interest. Indeed, these platforms make it easy to infer functions of such polymorphisms based on annotations of genomic regions where they are positioned. Rabbi et al. (2014) used SNP data generated from the GBS platform to develop a high-resolution genetic map that identified genomic regions of cassava controlling resistance to cassava mosaic geminiviruses.

It is against this background that the national cassava breeding programme in Uganda recently initiated a breeding objective tailored towards development of provitamin A carotenoid content in cassava by harnessing advancements in genomics, notably the GBS technology, and diverse carotene-rich genetic resources. Specifically, the current study was conducted to identify genomic regions and SNPs linked to natural variations for carotenoid content in cassava as a basis for more systematic carotenoid biofortification in the crop.

Materials and methods

Genotypes

A panel of 655 genotypes was considered for this study. These genotypes included an array of S1 and S2 partial inbreds generated from eight S0 progenitors of diverse genetic background (Table 1). S0 progenitors were carotene-rich lines selected from a set of clones previously introduced from the International Center for Tropical Agriculture (CIAT) and International Institute for Tropical Agriculture (IITA) for breeding in Uganda (Esuma et al. 2012). S0 progenitors were self-pollinated to generate S1 progeny during the 2009–2010 cropping season. Thirty-nine S1 clones with the ability to flower were planted in a crossing block in August 2011 and self-pollinated to generate S2 genotypes. In all cases, self-pollinations were performed following the standard procedure described by Kawano (1980).

Table 1 Pedigree and number of cassava genotypes used for the genome-wide association study

Phenotyping

Clonal trials for the 665 association panel were conducted between June 2014 and May 2015 at two locations in Uganda: Abi Zonal Agricultural Research and Development Institute (Abi-ZARDI) and National Crops Resources Research Institute (NaCRRI), which have different agro-ecological characteristics. NaCRRI is located within the Lake Victoria crescent at 32°37′36.0″E and 0°31′13.7″N, 1134 m above sea level, while Abi-ZARDI is located within north-western savannah grasslands at 31°1′28.4″E and 2°36′33.3″N, 1060 m above sea level. At each site, trials were laid out in an incomplete block design (Clewer and Scarisbrick 2001) with two replications. Plots in a block consisted of single rows of three plants each. Planting was done at a spacing of 1 × 1 m. Trials were entirely rain fed and all agronomic practices including weed management were strictly followed. Total carotenoid content (TCC) of roots was measured at 12 months after planting using the iCheck analytical kit developed by BioAnalyt Laboratory, Germany (http://www.bioanalyt.com).

Briefly, three roots (with the ability to provide ≥250 g parenchyma tissue) were processed for measurement of TCC, each root being a random selection from each of the three plants in a plot. The selected roots were peeled, washed under running water, dried with a paper towel and cut longitudinally into quarters so that the opposite quarters were pooled, chopped into small pieces and homogenised. About 5 g of the homogenous root sample was pounded and ground into a smooth and fine paste using a mortar and pestle. To aid grinding of the sample, 20 ml of distilled water was added gradually and the resulting solution transferred into a 50 ml calibrated tube. The tube content was shaken thoroughly and 0.4 ml of the solution injected into the iEx™ CAROTENE vial using the syringe and needle provided with the kit. Vials were placed on a solid surface for approximately 5 min, shaken again and allowed to stand until two solution phases appeared inside the vial: a clear upper phase and a turbid lower phase. At this point, the absorbance of the vial content (the upper solution phase) was measured using the iCheck™ CAROTENE device. TCC was calculated as:

$${\text{TCC }}\left( { \mu g^{ - 1} } \right) { = }\frac{{V_{s} }}{{W_{s} }} \times {\text{A}}$$

where Vs = volume of solution transferred to the tube, Ws = weight of sample and A = absorbance of the iEx™ CAROTENE vial content at a wavelength of 450 nm. Each sample was extracted and measured for TCC once. All procedures for carotenoid quantification were performed in a dark room and harvested roots were analysed for TCC within 12 h. Root flesh colour (RFC) was also scored for all genotypes using visual inspection following the standard colour scale as described by (Sánchez et al. 2006).

Genotyping

Total genomic DNA was extracted from newly expanded plant leaves according to the procedure described by Dellaporta et al. (1983). The GBS platform described by Elshire et al. (2011) was used to genotype the association panel. Briefly, restriction enzyme ApeKI, which is a type II restriction endonuclease that recognises a degenerate 5 bp sequence (GCWGC, where W is A or T), was used to digest DNA samples. ApeKI was selected as the restriction enzyme because it has partial sensitivity to DNA methylation and cuts genomic DNA at low frequency, thus avoiding repetitive genomic regions and enhancing adapter ligation to insert DNA (Peterson et al. 2014). GBS sequencing libraries (95-plex) were prepared by ligating the digested DNA to barcodes (unique nucleotide adapters) followed by standard polymerase chain reaction. Libraries were sequenced using 1-lane 86 bp reads using the Illumina HiSeq 2000. Sequence reads from different genotypes were de-convoluted using barcodes and aligned to the cassava reference genome sequence v6.1 (Glaubitz et al. 2014). The TASSEL-GBS pipeline was used to make SNP calls following the bioinformatics procedures described in Elshire et al. (2011) and Glaubitz et al. (2014).

Genotypic data were further processed to ensure high quality for subsequent analyses. The dataset originally had close to 70 % missing data points, which necessitated imputation of the non-genotyped markers, as described in earlier studies (Romay et al. 2013; Swarts et al. 2014). To achieve this, indels were first removed from the raw sequence data using TASSEL v5.2.9 (Bradbury et al. 2007) and the cleaned dataset imputed using Beagle software v4.0 based on an algorithm which searches for the closest neighbour in small SNP windows across the entire genome (Browning and Browning 2013; Swarts et al. 2014). To remove non-segregating and uninformative sites, the imputed data were further filtered at the following thresholds: minor allele frequency (MAF) = 0.01, minimum count of genotypes = 50 and minimum heterozygous proportion = 0.01. This filtered dataset represented a reduced marker set comprising 179,310 SNPs with a robust MAF desirable for estimating population structure and kinship, both of which were important for subsequent statistical analyses.

Statistical analysis

Out of the 655 individuals initially considered for this study, 591 genotypes had both phenotypic and genotypic data and were used for subsequent statistical analyses. A mixed linear model (MLM) fitted across environments was used to conduct analysis of variance (ANOVA) using the lme4 package of the R software (R Development Core Team 2010; Vazquez et al. 2010). The model used was:

$$Y_{\text{ijk}} = \mu + {\ss}_{i} + R_{\text{ij}} + G_{k } + ( {\text{ss}}_{i} {\text{x G}}_{k} )+e_{\text{ijkm}}$$

where Yijk = phenotypic value, μ = overall phenotypic mean, β i  = effect of environment i, R ij  = effect of block j in environment i, G k  = effect of genotype k, (β i , × G k ) = effect of interaction between environment i and genotype k and e ijkm  = residual. Both the phenotypic means and best linear unbiased predictions (BLUPs) extracted from the MLM were used for genome-wide association (GWA) testing to benefit from the additional power that this might offer.

Unless stated, all statistical analyses of the genotypic data were performed using TASSEL v5.2.9 (Bradbury et al. 2007). Principal component analysis (PCA) through eigenvalue decomposition of the covariance matrix was performed using the filtered SNP set to infer population structure. The first two axes of the PCA were used to draw a scatter plot to visualise genetic differentiation among genotypes. In execution of GWA analysis, the first three principal components constituted the population structure. A kinship matrix was constructed using the scaled_IBS method (IBS = identical by state). This method produces a matrix that is scaled to give a reasonable estimate of the additive genetic variance, which effectively controls statistical errors arising from population structure in a MLM for the analysis. The kinship matrix was calculated according to Bishop and Williamson (1990):

$${\text{IBS distance}} = \frac{{\left[ {{\text{No}} . \, {\text{of markers IBS}}2 + 0.5 \, \left( {{\text{No.}}\, {\text{of markers IBS1}}} \right)} \right]}}{\text{Number of non-missing markers}}$$

where IBS1 = state in which two genotypes share one allele at a locus and IBS2 = state in which two genotypes share two alleles at a locus.

Genome-wide and chromosome-wise LD were assessed based on adjacent pairwise r 2 values (the squared correlation coefficients among alleles at two adjacent SNP markers) and physical distances among these SNPs (Remington et al. 2001). LD was estimated using the unimputed dataset consisting of 194,322 SNPs, considering only markers with <30 % missing data and MAF >0.05. Average pairwise distances in which LD decayed at r 2 = 0.2 and r 2 = 0.1 were then calculated based on the model, whereby the expected value (E) of r 2 was:

$$E\left( {r^{2} } \right) { = }\left[ {\frac{{ 1 0 {\text{ + C}}}}{{ ( 2 {\text{ + C)(11 + C)}}}}} \right]\left[ { 1 { + }\frac{{ ( 3 {\text{ + C)(12 + 12C + }}C^{2} )}}{\text{n(2 + C)(11 + C)}}} \right]$$

where r 2 = squared correlation coefficient, n = sample size and C = model coefficient for the distance variable (Hill and Weir 1988).

After removing monomorphic and low-quality SNPs, a total of 179,310 SNPs were available for the 591 member association panel. The MLM was used for testing individual SNP-based associations, following the correlation/trend method described by Weir (2008). The method was chosen for statistical analyses due to its ability to achieve consistently good control over false positives while yielding the highest power among other structure correction methods (Wang et al. 2012). The MLM is illustrated as:

$${\text{Y }}= ({\text{SNP}} x {\ss}) + ({\text{PC}} x \alpha ) + {\text{K}} + \varepsilon$$

where Y = phenotypic response of the dependent variable (means or BLUPs of TCC), SNP = SNP marker (fixed effect), β = fixed effect model coefficient for SNP, PC = first three components of PCA (which accounted for 97.6 % variation) to present population structure (fixed effect), α = fixed effect model coefficient for PC (population structure), K = kinship matrix (random effect) and ε =  residual. Population structure was incorporated into this model to correct for their confounding effects on marker-trait associations, thus controlling false positives (Li et al. 2014).

Evaluations of the association mapping model were based on the quantile–quantile (Q–Q) plot, which is a plot of observed −log10 P values versus expected −log10 P values under the null hypothesis that there is no association between a SNP and the phenotype. The approach used for identifying GWA signals was based entirely on the smallest P values obtained from the MLM, where SNPs were ranked based on ascending order of their P values. SNPs with P values less than the 5 % Bonferroni threshold were considered to be significantly associated with phenotypes. Furthermore, chromosome-wise association signals were visualised from Manhattan plots generated using the qqman package of R software (Turner 2014). We browsed the cassava reference genome sequence v6.1 using phytozome v11.0 (Goodstein et al. 2012) to search for model genes in the proximities of significant SNPs. To overcome issues of non-collinearity that may exist within this genome version (given it is only partially complete), broader intervals (up to 0.5 Mb) were considered while searching for pathway genes in the vicinity of SNPs showing significant associations.

Results

Phenotypic variability and correlations

For the 591 genotypes considered for statistical analyses, average TCC was 4.9 µg g−1, varying from 0 to 13.45 µg g−1 on a fresh weight basis, which presented considerable variability for carotenoid content. RFC varied from 1 (white) to 6 (deep yellow). Distribution of the genotypes based on RFC revealed that 135 clones were white-fleshed, 37 light cream, 99 cream, 134 light yellow, 129 yellow and 57 deep yellow. ANOVA showed significant genotypic effects for TCC, but environmental effects were non-significant (Supplementary Table 1). Broad-sense heritability was high for TCC (0.91). Phenotypic values and BLUPs for TCC showed a bimodal distribution (Fig. 1). The Pearson correlation coefficient between TCC and RFC was high, with r = 0.84 (P ≤ 0.01).

Fig. 1
figure 1

Scatter plot and histograms of total carotenoid content (TCC) versus best linear unbiased predictions (BLUPs) of TCC. Colour 1-WT white, 2-LC light cream, 3-CM cream, 4-LY light yellow, 5-YL yellow, 6-DY deep yellow, red lines on histograms indicate means

Marker coverage and missing data

A total of 635 247 SNP markers were used to genotype the panel of 655 clones. Overall, there was a low SNP call rate, with an average proportion of missing data of 71 % (Supplementary Table 2). The average SNP call rate per sample was 28.2 %, varying from 1.3 to 81.4 %. Following removal of monomorphic markers and imputation of the genotype subset, 179 310 SNPs were selected and used for subsequent analyses. These SNPs were distributed along the 18 chromosomes and two linkage groups of unanchored scaffolds of the cassava genome sequence v6.1. When further filtering of the GBS data was done to remove genotypes missing phenotypes, a total of 591 genotypes were left for the association testing. A comparison of the imputed GBS data with results from the unprocessed GBS SNP data for the 591 member association panel showed a median discrepancy rate of 8.2 % for all calls. When heterozygous calls were excluded, the median error rate was reduced to 4.3 %, which highlighted high accuracy of imputation. The imputed data were therefore used to perform GWA analysis.

Population structure, allele frequency and linkage disequilibrium

Cassava genotypes used in this study were drawn from breeding populations developed using eight parental lines sourced from either IITA or CIAT; thus they presented a set of genotypes from diverse genetic backgrounds, which would be a possible source of population structure. PCA was used to describe the population structure for the association panel for this study. Grouping of genotypes along the axis of the first two principal components scattered genotypes around the centre, without clear-cut differentiation between them. However, classifying genotypes according to their origin revealed some level of genetic differentiation, with the IITA group drifting to the left along the first principal component axis (Fig. 2). When the PCA was examined using the GAPIT programme (Lipka et al. 2012) for analysis of population structure, the clustering pattern suggested some structure within the population studied (Supplementary Fig. 1). This apparent genetic differentiation provided the basis for using both PCA and the kinship matrix in the MLM for association analysis, which was an important strategy to minimise any false positives in the mapping result. Overall, 82 % of the SNPs typed in the entire population had MAF <0.01 (Fig. 3), with more than 71 % of the alleles shared between the CIAT and IITA accessions. The average physical distance between pairs of markers was 23.4 kb and average genome-wide LD (r 2) obtained based on adjacent pairs of markers was 0.29. Genome-wide LD decay was 1.32 kb at r 2 = 0.2 and 3.21 kb at r 2 = 0.1 (Supplementary Fig. 2).

Fig. 2
figure 2

Scatter plot of 591 genotypes based on principal component analysis. PC1 principal component 1, PC2 principal component 2, CIAT International Centre for Tropical Agriculture, IITA International Institute for Tropical Agriculture

Fig. 3
figure 3

Distribution of minor allele frequency (MAF) based on unfiltered genotyping by sequencing data

Association results

Based on the MLM analysis for TCC BLUPs, a strong GWA signal was detected (Fig. 4) and a chromosome-wise analysis mapped the signal to chromosome 1 of the cassava genome (Fig. 5). Within this peak, four SNPs had P values smaller than the 5 % Bonferroni threshold for genome-wide significance. This association signal was consistent for analyses based on TCC means (Supplementary Fig. 3). When the association tests were performed using RFC as a phenotype, an association hit similar to that obtained for BLUPs and TCC means was detected (Supplementary Fig. 4). This particular result corroborated the strong positive association between RFC and carotenoid content. Manhattan plots consistently reflected these results. In all these cases, the same set of SNPs was identified to show significant association signals. The four SNPs with P values below the 5 % Bonferroni correction threshold were located within a stretch of 1.37 Mb on chromosome 1 (Fig. 6). The MLM analyses based on both BLUPs and TCC means indicated highest −log10(P) for marker S1_24661940 (P = 4.4E−08) followed by marker S1_24654776 (P = 1.3E−08). These two SNPs accounted for the highest phenotypic variance (9 %) for TCC and were separated by a distance of 0.01 Mb (Table 2). Meanwhile, marker S-1_23297833 had the least −log10(P) and a low phenotypic variance (7 %) for TCC.

Fig. 4
figure 4

Quantile-quantile plots for diagnosis of association signals based on best linear unbiased predictions for total carotenoid content

Fig. 5
figure 5

Manhattan plots for genome-wide diagnosis of association signals based on best linear unbiased predictions for total carotenoid content. P and Q are linkage groups containing unanchored scaffolds in the cassava genome sequence v6.1. Horizontal blue and red lines are the suggestive and 5 % Bonferroni threshold lines, respectively

Fig. 6
figure 6

Manhattan plot for genome-wide diagnosis of association for chromosome 1, zooming out the region with association signal for total carotenoid content. Horizontal blue and red lines are the suggestive and 5 % Bonferroni threshold lines, respectively

Table 2 List of SNPs with genome-wide association significance for total carotenoid content

Reference was made to the cassava genome sequence v6.1 in Phytozome v10.3 (Goodstein et al. 2012) to identify annotated genes within a distance of ± 0.5 Mb of the genomic region occupied by the significant SNPs. Manes.01G124200.1, which is a phytoene synthase (PSY) gene known to increase accumulation of provitamin A carotenoid in cassava roots, was found within the genomic region occupied by significant SNPs. The genomic size of this gene is 3.3 kb stretching from 24,153,420 to 24,156,720 bp positions and falls within the 1.37 Mb portion occupied by association hit for TCC on chromosome 1. Functional annotation of this gene relates its biological activity to the enzyme PSY, which is strongly linked to carotenoid biosynthesis in plants (Goodstein et al. 2012). SNPs S1_23297833 and S1_24601429 were the significant markers that flanked Manes.01G124200.1 upstream and downstream, respectively. However, SNPs with the highest association significance for TCC were furthest downstream of the phytoene synthase locus.

Further reference was made to version 2.0 of the cassava haplotype map being constructed in the Buckler Laboratory at Cornell University, USA (unpublished data, Ramu Punna), for functional annotation of the SNPs significantly associated with TCC. Sequences of cassava genotypes rich in carotenoid content (yellow-fleshed genotypes with TCC >5 g µg−1) were compared to those of genotypes with negligible carotenoid content (white-fleshed genotypes with TCC <1 g µg−1), following an alignment in the Mega software. Amino acid lysine in a conserved region of S1_24601431 of white-fleshed genotypes was substituted by threonine in yellow-fleshed genotypes. SNP S1_24601431 in the haplotype map co-segregated with TCC in a similar pattern to that of S1_24601429 from the GBS data. Amino acid substitutions at the other three significant SNPs were non-informative with regard to carotenoid biosynthesis.

Discussion

Agricultural production across the world is being challenged by a rapid human population growth, limited arable land and adverse climate changes, which call for greater efforts to optimise the use of available resources needed to balance increased food production with reduced environmental effects. On their part, plant breeders could benefit from application of advanced genomic tools to discover and use functional genetic variations for crop improvement (Moose and Mumm 2008). Fortunately, diverse cassava genetic resources exist across the tropics for economically important traits (Nassar 2003; Nassar et al. 2007). Such useful genetic variation could be explored through GWAS to isolate and improve agriculturally important traits in a crop that will certainly provide food to this rapidly increasing human population (Ceballos et al. 2015). Accordingly, this study focused on carotenoid content, a trait of immense value for diets in SSA. Thus, a GWA testing with 179,310 SNP markers and 591 cassava partial inbreds was conducted with the view of understanding genetic causes of natural variation for carotenoid content in cassava. Such knowledge would provide a framework for undertaking more systematic genetic improvement of cassava for increasing levels of carotenoid content in roots.

It suffices to note that one serious consideration in defining an association panel is the extent of LD, which can greatly influence GWAS results because larger LD blocks and a slower rate of LD decay generally result in lower mapping resolutions (Grady et al. 2011). For example, Romay et al. (2013) observed more rapid LD decay in tropical germplasm of maize compared to that in temperate germplasm, which means a higher resolution map would be achieved by using the tropical germplasm. In the current study, two broad accession groups (originating from CIAT and IITA) showed comparable LD decay, implying that similar mapping results could be achieved with one or a combination of these sets of genotypes.

Four SNPs (S1_23297833, S1_24601429, S1_24654776 and S1_24661940) with GWA significance for TCC were identified to be located within a single major QTL. These polymorphisms were located in the vicinity of a previously characterised PSY gene with functions linked to carotenoid biosynthesis in cassava. PSY2-Y-2 was reported to increase accumulation of provitamin A carotenoids (Welsch et al. 2010) and the same gene family is known to regulate carotenoid accumulation in rice (Beyer et al. 2002), maize (Fu et al. 2010) and durum wheat (Giuliano 2014). According to Welsch et al. (2010), a SNP in PSY2-Y-2 gene co-segregated with high carotenoid content in cassava roots. This polymorphism resulted in a single amino acid change in a highly conserved region of the protein which, when tested in Escherichia coli, resulted in increased catalytic activity. Findings of the current study strongly corroborate the earlier reports on carotenoid genes in cassava by Welsch et al. (2010). It is also very likely that additional genetic variants that explain natural variation for carotenoid content in the crop exist around this genomic region. Based on the population used in this study, carotenoid accumulation in cassava roots appears to be under the control of a single gene with several allelic variants, which share strong positional relationship with a previously characterised PSY gene. Essentially, the current study strongly supports previous reports that suggested carotenoid content in cassava roots as a qualitative trait.

The GWA hits for TCC concentrated around the region occupied by Manes.01G124200.1, but the most significant SNPs within the hit were located furthest from this carotenoid gene. This apparent clustering of the significant SNPs could be a manifestation of more than one QTL for TCC within the association region, suggesting the possibility of discovering additional genes controlling accumulation of carotenoid content in cassava. Welsch et al. (2010) identified provitamin A carotenoid polymorphisms using a bi-parental mapping population of 164 individuals developed from two landrace genotypes from Brazil. An important feature of bi-parental mapping populations is that alleles segregating in progeny are limited to allelic diversity in the two parents (Semagn et al. 2010). In the current study, a panel of diverse germplasm of both African and Latin American origins was used to survey polymorphisms linked to TCC across the cassava genome. This approach may have presented an opportunity for discovering additional polymorphisms that could furthermore explain the extent of phenotypic variation for carotenoid content. It is possible that the diverse genetic material assayed in this study increased the frequency and chances of detecting novel genetic variants explaining the global phenotypic diversity for TCC (Korte and Farlow 2013).

Indeed, substitution of lysine in the white-fleshed genotypes by threonine in the yellow-fleshed genotypes at S1_24601429 strongly suggests the metabolic role of this particular SNP in carotenoid biosynthesis. Threonine has been reported as a substance encoded by PSY1 in all maize varieties that produce yellow endosperm (rich in provitamin A carotenoids) (Shumskaya et al. 2012; Shumskaya and Wurtzel 2013). A variant of threonine expressed in golden rice was also attributed to enzyme PSY in carotenoid biosynthesis (Paine et al. 2005). Thus, it can be postulated that SNP S1_24601429 is a component of a PSY gene previously reported to increase accumulation of carotenoid content in cassava roots (Welsch et al. 2010). It can further be suggested that more than one allelic variation influencing carotenoid accumulation in roots exist within this locus. For example, the allelic polymorphism in the conserved region of the PSY locus reported by Welsch et al. (2010) resulted into change from alanine (in the white-fleshed genotypes) to aspartic acid (in the yellow-fleshed genotypes), different from that found in this study where lysine was substituted to threonine. As an improved version of the whole genome sequence of cassava becomes available in future, detailed biological annotations of these SNPs and their use for systematic cassava biofortification will become more apparent.

The heterozygous nature and large genome size of cassava are important factors to consider when applying NGS technologies to study the crop (Oliveira et al. 2012; Ceballos et al. 2015). Version 6.1 of the cassava genome assembly is approximately 582.25 Mb in length with a total of 33,033 loci containing protein-coding transcripts arranged on 18 chromosomes and 2001 scaffolds that have not been anchored onto specific chromosomes (Goodstein et al. 2012; Prochnik et al. 2012). To achieve reasonable coverage of such a large genome through GBS, a large number of SNPs distributed throughout the genome would be required (Deschamps et al. 2012). In this study, 635,247 SNPs were used to genotype 591 clones. Unprocessed GBS data showed more than 70 % missing data points. After filtering to remove non-informative SNPs, 179,310 markers were used for GWA analysis, which may represent low genome coverage and thus reduced the chances of capturing more polymorphisms with statistical significance. As observed in barley (Pasam et al. 2012), SNPs in association testing commonly explain a low percentage of genetic trait variation (<4 %) compared to QTL effects (as high as 47 %) detected through bi-parental mapping (Abdel-Haleem et al. 2010), which essentially limits the ability of GWAS to detect association signals with small effects. Efforts are on-going, in collaboration with the Next Generation Cassava Breeding Project (www.nextgencassava.org), to develop genomic tools to enhance molecular breeding of cassava. Before long, such resources could make GWAS a more powerful tool for dissecting genetic architecture of important agronomic and quality traits of cassava that have been too complex to exploit through conventional breeding methods (Ceballos et al. 2012, 2015).

In conclusion, this study was one of the pioneering efforts to apply NGS technologies to elucidate genetics of cassava at molecular level and has indeed proved that GWAS is a powerful tool for unravelling genetic factors linked to natural variations for important traits in cassava. The GWAS identified four SNPs strongly linked to carotenoid content in cassava, located within the vicinity of Manes.01G124200.1, a gene known to increase accumulation of TCC in roots. These SNPs linked to TCC could provide a refined insight into genetic architecture of carotenoid content and offer a practical basis for developing cassava varieties with high levels of provitamin A carotenoids. However, further efforts are required to sequence the genomic region containing the significant association signal in order to uncover useful sequence information that can underpin metabolic roles of the associated SNPs with regard to accumulation of TCC in cassava roots. Such information will enable, for example, use of the SNPs as fixed effects when implementing genomic selection for improving cassava for carotenoid content in genetic backgrounds of relevant agronomic traits like root dry matter content, which is consistently negatively correlated with TCC in African cassava germplasm (Njoku et al. 2011, 2015; Esuma et al. 2016). Collectively, these strategies would enhance the chances of developing cassava varieties with increased levels of provitamin A carotenoids, which, when deployed to farmers, could sustainably reduce VAD that is currently prevalent where the crop is a major staple.