Introduction

Cassava breeding efforts throughout the world have made significant impact on cassava production, particularly in terms of disease tolerance, yield and quality improvements (Legg et al. 2006; Nweke et al. 2002), however it is anticipated that additional gains in breeding efficiency, which would translate into genetic gain, could be made through the application of advanced molecular breeding technologies. This has been demonstrated in highly researched crops such as maize (Eathington 2005) and rice (Ragot and Lee 2007). Cassava is a difficult crop to breed, due to its intrinsic heterozygosity, variable flowering time, low seed set, and long breeding cycle combined with the agricultural goals of diverse use of the products and growth in harsh environment conditions both in terms of biotic and abiotic stresses (Jennings and Iglesias 2001).

While the use of molecular markers in cassava breeding is in its infancy (Okogbenin et al. 2007), they have been applied in plant breeding of other crops in several ways. Marker-assisted selection (MAS) can be applied if markers are either directly associated with a trait (a functional gene with a tractable phenotype) or closely associated with known genes of interest. This is the classical form of MAS, in which either single genes or quantitative trait loci (QTL) may be selected. With the advent of next-generation sequencing (NGS) technologies (454 pyrosequencing, Solexa, and SOLiD), the associated proliferation of single nucleotide polymorphism (SNP) markers, and the possibility of genotyping-by-sequencing (GBS), approaches based on genome-wide marker scans, such as marker-assisted recurrent selection (MARS) or genomic selection, may be applied for more rapid genetic gain in highly qualitative traits such as yield or drought tolerance (Heffner et al. 2011).

Molecular breeding in cassava would confer several advantages, including: (1) more accurate gene-based selection, which would allow, for example, the pyramiding of resistance genes; (2) enhanced genetic gain in quantitative traits through predictive modeling (MARS and genomic selection); (3) reduced breeding population size, which would allow breeders to work on a larger number of genotypes simultaneously; (3) reduced amount of time to product delivery; and (4) preemptive breeding in environments where particular stresses, such as cassava mosaic disease (CMD) or cassava brown streak disease (CBSD), are not currently present, but pose a significant threat.

The successful application of molecular markers does however rely on the availability of genomic technologies. In recent years the availability of genomic resources for cassava has increased substantially, most notable through the sequencing of the cassava genome (http://www.phytozome.net/cassava; Prochnik et al. 2011). Here we review the current availability and ongoing applications of molecular markers to cassava research and breeding, highlight new genomic resources, and discuss and speculate on the implications of NGS on molecular breeding strategies for cassava.

Availability of Molecular Markers

Molecular markers, such as random amplified polymorphisms (RAPDs) and restriction length polymorphisms (RFLPs), were first used in cassava to study the genetic diversity within the genus Manihot (Marmey et al. 1993). Later, amplified fragment length polymorphism (AFLPs) were used to understand genetic differentiation in cassava (Elias et al. 2000; Fregene et al. 2000; Roa et al. 1997). The first genetic linkage map of cassava utilized AFLP, RAPDs and isozymes (Fregene et al. 1997). For the past decade, however, these marker types have largely been replaced by simple sequence repeat (SSR) markers.

Simple Sequence Repeat (SSR) Markers

Microsatellites (Litt and Luty 1989) or SSRs (Tautz et al. 1986) are two-, three- or four-nucleotide tandem repeat units. They reflect genomic points of variation within a species, which is more highly variable when the repeat number is ten or greater (Queller et al. 1993). SSRs are largely co-dominant, multi-allelic, and dispersed throughout the genome and can be multiplexed on semi-automated systems (Varshney et al. 2005). In cassava, several groups have developed a few thousand SSR markers from expressed sequence tags (ESTs) and enriched genomic DNA libraries (Chavarriaga-Aguirre et al. 1998; Mba et al. 2001; Raji et al. 2009; Sraphet et al. 2011; Tangphatsornruang et al. 2008). As the SSR resources were developed independently by different research groups, several SSR primer pairs of different names target the same SSR. To identify duplicates, cassava SSR information was curated using the recently developed cassava genome assembly as a reference.

The Identification of a Non-redundant SSR Dataset

The mapping of polymerase chain reaction (PCR) products onto the genome was used to identify redundancy in the SSR collection. A total of 6,752 cassava SSR primer sequences (forward and reverse primers for 3,367 SSRs) were compiled from various sources (Table 1) and queried using BlastN against the cassava genome sequence (manihot_esculenta_147, 12,977 sequences, 532.5 Mb, 3/31/11; ftp://ftp.jgi-psf.org/pub/JGI_data/phytozome/v7.0/Mesculenta/assembly/). From these, 5,402 SSR primers aligned to genomic scaffolds. There were many cases in which the forward and reverse primers of the same SSR were located on either different scaffolds or multiple scaffolds. In 2,079 cases, primers were designated as pairs based on SSR ID and distance between the primers. The average size of the PCR fragments from the paired SSRs was 302 bp. A total of 1,917 SSRs (92.2%) produced a PCR product less than 500 bp, an optimal size for routine PCR testing. Eighty-eight SSRs (4.7%) appeared to produce a PCR product larger than 1 kb, which is indicative of an intron.

Table 1 Simple Sequence Repeat (SSR) markers in cassava

The PCR products of 716 SSRs overlapped with those of other SSRs within a 1 kb range and were thus designated as duplicates. The remaining 1,363 SSRs appeared to be unique SSRs. The 716 duplicates were consolidated into a set of 312 primer pairs. These representatives together with the unique SSRs generated a set of 1,675 SSRs validated by genome comparison. To identify those SSRs that were not present in the cassava genome sequence, all 6,752 SSR primer sequences were blasted against the NCBI M. esculenta nucleotide ESTs (80,631 sequences; 40.8 Mb). This query identified 2,169 SSRs. Of these, 1,526 had already been identified from the curation process using the genome sequence, meaning a total of 643 SSR were additionally identified by EST blast analysis. Of these newly validated SSRs, 341 were unique and 302 were duplicates. The duplicates were curated into 130 non-redundant SSRs.

In total, we generated a non-redundant set of 2,146 SSRs comprised of 1,675 curated from the genome and 471 curated from ESTs (Table 2). The curated, paired SSR sets and associated information can be found at http://bioinformatics.iita.org/cassava_SSRs. Although this set of SSR markers is useful, it is still limited compared to some other crop plants. For example the Gramene database contains 19,480 SSRs (http://www.gramene.org/markers/microsat/), mainly from Oryza sativa Japonica group. The current paucity of SSRs lowers the genome resolution and limits their use for breeding and genome-wide studies.

Table 2 Simple Sequence Repeat (SSR) curation process in cassava

Curated SSR Integrated with the Genome Assembly

The draft genome and latest SSR-based genetic map (Sraphet et al. 2011) provide an opportunity to integrate the curated SSRs into the genome and genetic linkage maps. Using BlastN, 1,675 SSRs validated by genome sequence were located on 686 genome scaffolds, which totaled 256 Mb. The current version of the cassava genome assembly consists of 12,977 scaffolds spanning 533 Mb. This implies that about 50% of the draft cassava genome, in terms of length of SSR-containing scaffold, is linked to the curated SSR sets. Scaffold information of 1,430 putative SNPs of the cassava genome were obtained from recent SNP research (Ferguson et al. 2011). These and the curated SSRs were analyzed in relation to existing SSR-based genetic linkage maps, number of scaffolds and genome coverage. A total of 1,298 SSRs were found on the genetic linkage map, representing 253 scaffolds and 137 Mb. Results are presented in Table 3. It is anticipated that individual tracks of SSRs and SNPs will be added to the genome sequence in the online Phytozome database.

Table 3 Integrated Cassava Simple Sequence Repeat (SSR) information

DArT Genotyping of Cassava and Its Wild Relatives

Prior to the development of SNP markers, marker density and the high cost per data point limited the application of molecular markers to cassava genetic resource conservation and breeding. Cassava geneticists urgently needed a set of molecular markers that more completely covered the genome, that were based on a technology platform that was easily accessible, and that could be scored readily in new germplasm by any member of the global cassava research community. In early 2000, a novel molecular marker technique based on micro-array DNA hybridization was developed. This technique can genotype hundreds of polymorphisms across a large number of individual plants. With the proper bioinformatic analytical tools, the Diversity Array Technology (DArT) can be used to characterize several hundreds to thousands of polymorphisms in a timely, cost-effective manner. The first developed cassava DArT array had nearly 1000 polymorphic clones with a 99.8% reproducibility (Xia et al. 2005), offering a high-throughput marker screening system at a low cost.

DArT locus informativeness was tested against the well-studied cassava SSR marker system in 436 cassava accessions at Centro Internacional de Agricultura Tropical (CIAT). It was concluded that, even though SSRs sampled significantly less loci per reaction than the DArT technology, SSRs provided greater differentiation and more effectively recovered the patterns of genetic diversification in the genus. These results indicated that DArT markers have a limited application for cassava germplasm characterization (Becerra Lopez-Lavalle per. comm.).

Single Nucleotide Polymorphism Markers

SNP markers and small insertions and deletions (indels) represent the most frequent form of naturally occurring genetic variation within populations (Cho et al. 1999). The identification of a high density of SNPs in cassava would dramatically facilitate progress in cassava genomics and breeding. SNPs are generally biallelic (two alleles at a locus) and in this sense are individually less informative than SSRs, which are generally multi-allelic (many alleles at a locus; Syvänen 2001). This drawback is compensated for by the abundance and suitability of SNPs to ultra-high-throughput genotyping techniques (Appleby et al. 2009; Rafalski 2002). The utilization of multi-SNP haplotypes can offset the relatively low information content of single SNP loci (Brumfield et al. 2003). An early study of sequence variation in cassava identified 136 SNPs from EST sequences and 50 SNPs from bacterial artificial chromosome (BAC) end sequences (Lopez et al. 2005). Kawuki et al. (2009) studied sequence diversity in nine genes and identified 26 SNPs. Sakurai et al. (2007) reported the identification but no details of 2,356 SNPs (Sakurai et al. 2007). The frequency of SNPs was found to be one every 53 bp in non-coding regions and every 181 bp in coding regions, with an average of one SNP every 121 bp (Kawuki et al. 2009). Lopez et al. (2005) found on average a frequency of one SNP every 62 bp. This high frequency of SNPs is consistent with the finding from other species, such as grapevine (Salmaso et al. 2005) and maize (Ching et al. 2002).

The recent identification and validation of 1,190 SNP markers in cassava has been reported from a total of 2,954 putative EST-derived SNPs (Ferguson et al. 2011). These SNPs have been located on scaffolds of the cassava genome sequence (v.4.1). The University of Maryland’s Cassava Genome Database provides 384 putative SNPs derived from genes, and 371 putative SNPs derived from the cassava physical map. A putative SNP is a nucleotide variant that has been identified from sequence data, but has not been validated and may be a result of sequencing error. An explosion in the identification of SNPs in cassava is anticipated in the near future with the plummeting cost of sequencing, which has made whole genome re-sequencing projects feasible.

A large number of known SNPs will be available for genotyping using custom-made SNP arrays. Ferguson et al. (2011) used Illumina GoldenGate (Illumina Inc., San Diego, CA) technology to assay 1,536 putative SNPs’. However, this approach is relatively inflexible in terms of the number of markers and genotypes assayed and, for this reason, may be more suited to genomic or association mapping approaches. Recently, the Generation Challenge Program (GCP) converted 1,740 SNPs in cassava for use on the KASPar platform (LGC). This system is extremely flexible in terms of the combination of numbers of markers and samples that can be genotyped and, therefore, is particularly suitable for molecular breeding applications, such as MAS or MARS. Converted KASPar markers are available through the GCP IBM marker services (http://marlow.iplantcollaborative.org/marker-service). We anticipate that in the near-to-medium term most common breeding applications will rely on low cost SNPs in relatively low density formats such as KASPar. The de novo discovery of SNPs through reduced representation GBS is currently under development for cassava in a number of laboratories. Reduced representation GBS is based on reducing genome complexity with methylation-sensitive restriction enzymes (REs) and barcoded oligonucleotide adaptors, followed by high-throughput sequencing. The GBS procedure mapped roughly 200,000 and 25,000 sequence tags in maize (IBM) and barley (Oregon Wolfe Barley), respectively (Elshire et al. 2011).

Other Genomic Resources

ESTs are partial sequences (200–800 bp) of expressed genes randomly picked from a complimentary DNA (cDNA) library. EST data combined with full-length cDNAs form an important resource for allele mining and marker development. To date, 80,631 cassava ESTs have been deposited in GenBank (Ferguson et al. 2011; Lokko et al. 2007; Lopez et al. 2004; Sakurai et al. 2007). A sub-set of nearly 60,000 of these, filtered on quality, have been compiled into the HarvEST:Cassava database (http://harvest.ucr.edu). RIKEN Institute, Japan, in collaboration with CIAT have generated two further EST libraries and report Sanger, 454 and Illumina paired-end sequences (Utsumi et al. 2011). To date no RNAseq next generation sequencing data is available, although a number of projects using this technology are reported to be in progress.

A number of microarrays have been developed for cassava, although their application appears to be limited. A Euphorbiaceae microarray was developed by James Anderson and consisted of a unigene set of 19,015 cDNAs from leafy spurge (Anderson et al. 2007). A cassava unigene microarray was produced by Valerie Verdier (Lopez et al. 2004). A 60-mer oligonucleotide microarray representing 20,840 cassava genes was used to study storage root formation and stress response (Yang et al. 2011). A similar array was developed with approximately 11,000 probes for transcriptome analysis (Ingelbrecht et al. 2008).

The Application of Molecular Markers to Diversity Assessments

Several studies have characterized the genetic diversity in cassava gene pools with the aim of aiding genetic resource conservation and breeding programs. One of the first attempts using molecular markers on a global scale looked at genetic diversity, at differentiation and for potential heterotic groups in cassava (Fregene et al. 2003). The genetic diversity was assessed in 283 cassava landraces from Tanzania (163), Nigeria (29) and the Neotropics (Brazil, Colombia, Peru, Venezuela, Guatemala, Mexico and Argentina; 91) using 67 marker loci. The high levels of genetic diversity they found in all countries was unexpected, considering the probable center of domestication along the southern rim of the Amazon Basin and the later expansion into other regions of the Neotropics, Africa and Asia (Olsen and Schaal 2001) that would have been expected to produce a founder effect of reduced diversity and an increase in genetic differentiation. The authors attributed the observed high levels of diversity to spontaneous recombination in farmer’s fields. Levels of molecular marker diversity in cassava landraces from Africa and several Neotropical countries were also found to be similar to those reported in Brazilian landraces (Beeching et al. 1993; Fregene et al. 2000).

Fregene et al. (2003) observed both a separation between Neotropical and African landraces and a more pronounced substructure in the African accessions as compared to the Neotropical landraces. Landraces from Guatemala were particularly highly differentiated from other regions. This general structure agrees with a previous AFLP marker study of 29 African and 11 Neotropical landraces (Fregene et al. 2000).

A larger diversity study of approximately 2,300 accessions from the International Institute of Tropical Agriculture (IITA), Empresa Brasileira de Pesquisa Agropecuária (EMBRAPA) and CIAT gene banks using 30 SSR markers was conducted (Hurtado et al. 2008). While this is not a global representation of cassava diversity, as only 2.4% of the varieties were from southern, eastern or central Africa (http://gcpcr.grinfo.net/), the study did observe a separation of accessions from Africa and Latin America/Asia. It also revealed a separation of some Nigerian accessions from the rest of Africa, and some Guatemalan accessions from other Latin American samples.

Until 2010, only relatively small-scale genetic diversity assessments of cassava in southern, eastern and central (SEC) Africa had been undertaken using a range of molecular markers (Benesi 2005; Fregene et al. 2000; Kizito et al. 2005; Zacarias et al. 2004). Using 26 SSR markers, Kawuki (2009) examined the nature and extent of genetic variation within a group of 1,401 cassava varieties from seven countries in SEC Africa: Tanzania (270 genotypes), Uganda (268), Kenya (234), Rwanda (184), Democratic Republic of Congo (DRC; 177), Madagascar (186) and Mozambique (82) (Kawuki 2009). This study revealed somewhat uniformly high levels of diversity across the region. It also revealed a subtle diversity sub-structure, with farmer-varieties from southern and eastern coastal areas (Madagascar, Mozambique and Tanzania) clustering together and germplasm from the Democratic Republic of Congo, Uganda and Rwanda clustering together. This could reflect two routes of introduction of cassava into Africa, one through West Africa during the 1700s (Carter et al. 1992; Jones 1959) and the other through the east African coastline in the 1750s, where it was then introduced to Madagascar and to mainland Africa (Jones 1959; Langlands 1966).

Recently, 1,190 SNP markers were used to genotype 53 cassava varieties from the Americas, Asia, West Africa and SEC Africa (Ferguson et al. 2011). This study shows a similar structure in the germplasm as that observed by Fregene et al. (2003) using SSR markers. Although germplasm tended to cluster together on a regional basis, the Americas, West Africa and SEC Africa, other groups contained germplasm of mixed origin. A slightly larger genetic diversity was found in the Americas (0.3488) compared to that in Africa (0.3357), both with a sample size of 22. These relatively uniform levels of diversity were consistent with those observed by Fregene et al. (2003). Mean observed heterozygosity from SNP markers was lower than that observed by Fregene et al. (2003). This was attributed to the biallelic nature of SNPs as opposed to the multi-allelic nature of SSRs.

Molecular markers have been applied to a number of other purposes related to diversity in cassava. Chavarriaga-Aguirre et al. (1999) used SSR markers to identify duplicates in the CIAT core collection (Chavarriaga-Aguirre et al. 1999). Similarly, Ferguson et al. (2011) demonstrate the use of SNPs to detect duplicates in the IITA Genebank collection. SNP and SSR markers were used to make inferences about cassava’s putative wild progenitor and to analyze variation in its natural populations (Olsen and Schaal 2001; Olsen 2004).

Another obvious, but so far unexploited, use of genetic diversity data is delineation of heterotic grouping in cassava to maximize heterosis for hybrid breeding. Heterosis, defined as increased performance of hybrid progeny compared to their parents, is known to be a function of genetic divergence among cultivars and is more highly expressed in outbreeding species (heterozygous) than in inbreeding ones (Becker 1993). There is an acute need to utilize the existing molecular diversity and genetic differentiation data for classifying and delineating heterotic groups in cassava as was done at the turn of the last century in maize (Crow 1998). Assigning germplasm to genetically divergent heterotic groups is fundamental for optimum exploitation of heterosis through crossing of complementary lines or populations. Traditionally, pedigree analysis, agro-morphological differentiation, measurement of heterosis and combining ability analysis in diallel crosses have been employed to establish heterotic patterns and grouping. This can be quite demanding, considering the large number of pair-wise crosses required. Molecular markers such as SSRs and SNPs can be used to determine genetic relatedness and thus assist in the selection of parents required for experimental studies to identify heterotic groups. Theoretical and empirical research in maize and other crops have identified linear association between marker-based genetic distance and heterosis (Reif et al. 2003). Being a clonally propagated species, heterosis can be effectively exploited as breeders need to create a superior genotype once yet maintain it indefinitely.

The Application of Markers to the Discovery of Marker-Trait Associations

Genetic Linkage Maps

Genetic linkage maps form the basis of many approaches to the discovery of both major genes and QTLs (Lander and Botstein 1989) that in turn can be applied to conventional MAS. A genetic map also offers a framework for carrying out evolutionary and comparative genomic studies (Ahn and Tanksley 1993) and contributes to understanding the organization and dynamics of genomes, such as landscapes of linkage disequilibrium (LD) (Flint-Garcia et al. 2003; Sewell et al. 1999).

Due to cassava’s highly heterozygous nature, inbreeding depression during selfing (Rojas et al. 2011), long growing cycle and low seed number per cross, genetic linkage mapping has traditionally been carried out in outbreeder full-sib (F1) families that are genetically fixed through vegetative propagation. A similar approach has been used in other perennial and clonally propagated crops such as tea (Hackett et al. 2000), rhodesgrass (Ubi et al. 2004) and banana (Hippolyte et al. 2010). Linkage maps are calculated using a double pseudo-testcross mapping strategy to create separate parental linkage maps in order to account for two independent parental meioses before simultaneous analysis (Grattapaglia and Sederoff 1994). Table 4 summarizes the salient characteristics (population type, sample size, number and types of mapped markers, number of linkage groups, map coverage and average distance between adjacent markers) of available cassava genetic maps. The first cassava genetic map was generated from a full-sib cross and consisted of 168 markers, mainly RFLPs and RAPDs, as well as a few SSRs and isoenzymes (Fregene et al. 1997). This linkage map was based on an F1 segregating population (known as the K family) of two geographically divergent parents (♀TMS30572 x ♂CM2177-2). The female parent TMS30572, from Africa and with resistance to CMD, was derived by introgressing M. glaziovii into M. esculenta (Hahn et al. 1980), while the male parent was an elite CIAT cultivar with no resistance to CMD but a high photosynthetic rate. Currently, CIAT and Diversity Arrays Technology Pty Ltd are collating and analyzing DArT marker data for 150 F1 progeny of the K family with the aim of saturating the K family genetic map. This should enable scaffolds of the current cassava genome sequence to be anchored on the K family genetic map (Becerra Lopez-Lavalle per. comm.).

Table 4 Summary of published genetic linkage maps of cassava

Nearly a decade later, a second SSR-based genetic map consisting mainly of an F2 population derived from a single F1 plant was developed (Okogbenin et al. 2006). More recently, several maps with between 137 and 510 markers comprised of AFLPs and SSRs (both genomic and EST-derived) have been published (Chen et al. 2010; Cortés et al. 2002; Kunkeaw et al. 2010; Kunkeaw et al. 2011; Marín Colorado et al. 2009; Sraphet et al. 2011; Whankaew et al. 2011). Most of these maps were developed from two specific bi-parental crosses, Huay Bong 60 × Hunatee and TMS 30572 × CM 2177–2, which together account for five out of the seven current cassava genetic maps. In spite of this common map derivation and the fact that many SSRs are present in multiple maps, cassava lacks a unified consensus map. A number of SSR- and SNP-based genetic maps of cassava are under development, which should increase marker map density.

Mapping of Quantitative Trait Loci (QTL)

Molecular markers have been widely used for mapping QTL underlying agronomic traits in many crops. In cassava, QTL controlling cyanogenic glucoside accumulation and dry matter content (Kizito et al. 2007; Whankaew et al. 2011), plant architecture and productivity (Boonchanawiwat et al. 2011; Okogbenin et al. 2008; Okogbenin and Fregene 2003), bacterial blight (Jorge et al. 2000; Jorge et al. 2001; Lopez et al. 2007; Wydra et al. 2004), wound response in post-harvest physiological deterioration (PPD) (Cortés et al. 2002), carotene levels (Marín Colorado et al. 2009), CMD (Akano et al. 2002) and CBSD (Kulembeka 2011) have been reported. Table 5 provides a summary of some published QTL studies in cassava, including target traits, parents and population type, and markers used.

Table 5 A summary of QTL studies in cassava

Unfortunately in many cases, a small sample size and limited number of markers has led to poor resolution of QTL markers. For example, of the nine QTL related to productivity component traits identified by Okogbenin et al. (2008), seven had QTL intervals of between 16 and 44 centi-Morgans (cM), despite the fact that this study employed the largest population size of all the studies, 268 individuals. In addition, as described above, most studies were conducted on the ‘K family’ or derivatives thereof (Table 5). This bi-parental population represents an extremely small proportion of the global cassava diversity, and the identified QTL may have little relevance to QTL segregation in other populations, thus limiting the scope of inference and the application in marker-assisted selection. A more comprehensive dissection of genetic architecture requires development of multiple populations that represent a larger sample of the available genetic variation in the species (Holland 2007). A feasible alternative to the creation of multiple bi-parental populations is the adoption of newer QTL dissection approaches, such as association mapping (Buckler and Thornsberry 2002).

Genome-Wide Association Mapping

Association mapping or LD mapping has been proposed as an efficient way to determine the genetic basis of complex traits (Abdurakhmonov and Abdukarimov 2008). Association mapping relies on germplasm samples and does not require the development of bi-parental populations. Compared to conventional linkage mapping, association mapping takes advantage of historical LD between genes coding for a trait and closely linked markers for mapping. In comparison, the classical F1-based QTL mapping population is characterized by a small number of recombination events per chromosome (Abdurakhmonov and Abdukarimov 2008; Nordborg and Tavaré 2002; Stich et al. 2006). Association mapping thus has the potential to provide greater map resolution. In addition, the use of diverse germplasm in association mapping enables many alleles and traits to be evaluated simultaneously (Stich et al. 2006).

The applicability and resolution of association mapping and other modern breeding approaches such as genomic selection depend on the extent and structure of LD within the population under consideration. LD is the nonrandom association of alleles at different loci and is affected by the breeding system of the species (selfing versus outcrossing), population structure and genome-wide recombination patterns (Flint-Garcia et al. 2003). Rapid decay in LD, a common feature in outbreeding crops (and cassava is not expected to be an exception) means that a substantially greater number of genetic markers are needed to detect linkage between a marker and a causal locus (Yu and Buckler 2006). To our knowledge, no information is available on genome-wide or intragenic LD in cassava germplasm. This information is required to set the stage for genome-wide scans that will uncover associations between molecular markers and important agronomic traits.

Gene Discovery

Cassava production in developing countries is beset by a multitude of pests and diseases (Ceballos et al. 2004; Dixon et al. 2003). It is important that research identifies sources of resistance or tolerance to these continuously evolving stresses and devises strategies to efficiently deploy them in a range of germplasm. Genome-wide surveys have resulted in the identification of about 150 resistance gene analogues (RGAs) in Arabidopsis (Meyers et al. 2003; Tan et al. 2007), about 500 in rice (Zhou et al. 2004) and about 400 in poplar (Tuskan et al. 2006). In the pre-genome sequence era, degenerate primers were successfully used to isolate RGAs, resulting in thousands of NBS-LRR (nucleotide binding site-leucine rich repeat) like partial sequences (Bai et al. 2002; Budak et al. 2006; Chen et al. 2007; Gedil et al. 2001; van der Linden et al. 2004) and prompting the formation of dedicated databases (Sanseverino et al. 2010). Candidate genes retrieved by this method were successfully used to develop and map molecular markers co-segregating with disease resistance traits (Calenge et al. 2005a; Calenge et al. 2005b; Moroldo et al. 2008; Qiu et al. 2007) and, in many cases, were found to map close to major resistance gene QTL (Gebhardt et al. 2006; Gebhardt and Valkonen 2001; Speulman et al. 1998).

A previous study in cassava reported the isolation of 12 classes of resistance gene candidates (RGCs), of which two full-length protein coding sequences were identified and mapped on the framework cassava linkage map (Lopez et al. 2003). Using a similar comparative approach, a study carried out at IITA led to the isolation and characterization of over 500 partial sequences of NBS-LRR-type R genes in cassava and its relatives (M. glaziovii, M. brachyandra, M. epruinosa, M. tripartita, and castor bean, Ricinus communis) in the Euphorbeaceae family (Gedil et al. submitted). More than half (353 sequences or 64%) of the total sequences had open-reading frames (ORF) uninterrupted with stop codons, whereas the rest of the sequences did not have an ORF, which implies that the genes are not expressed and are pseudogenes that have no functional role. Both TIR (toll interleukin 1 receptor) and non-TIR sub-families were observed by phylogenetic analysis. Multiple sequence alignment (MSA) revealed that the newly identified sequences showed similarity to domains/motifs of known R genes, identifying them as candidate R genes. The candidate sequences matched many homologous R genes in the draft cassava genome with high sequence similarity. This finding furnishes fundamental knowledge about RGAs in cassava and wild Manihot species. It is anticipated that understanding the structure, localization, function, variation, and evolution of resistance genes, in combination with other gene and/or genetic mapping approaches, will enable the development of functional gene-targeted markers for use in molecular resistance breeding and of novel strategies for anticipatory and durable disease control (Lawson et al. 2010).

The Application of Markers to Cassava Breeding

Marker-Assisted Selection (MAS)

Breeding a new variety of cassava usually takes 10 years, due to its long growth cycle (12–18 months). MAS can dramatically increase the precision of selection, leading to more rapid genetic gain fewer cycles of phenotypic evaluation and, thus, reducing the time for varietal development. Under the current molecular breeding scheme used for cassava, varieties could be tested for release in six years. Additionally, use of MAS in the seedling stage dramatically reduces population sizes, making breeding more economical and allowing breeders to work on a larger number of populations.

The only known applications of molecular breeding in cassava are selection for CMD and cassava green mite (CGM) resistance in CIAT and National breeding programs. MAS has rapidly facilitated the breeding for CMD2-meditated resistance in Latin America (in the absence of the pathogen) and in Africa, where the disease is most prevalent (Blair et al. 2007). Other field evaluations have indicated that the markers RME1 and NS158 are excellent predictors of CMD resistance (Okogbenin et al. 2007). CMD resistance was introgressed into improved elite CIAT lines using markers. These are now referred to as the CR-series (CR families). Two markers (NS1009 and NS346) associated with CGM resistance have been used in MAS (Okogbenin, per. comm.). Combining CMD and CGM resistance, another set of families was developed using markers and denoted as the AR series. Both AR and CR genotypes of the same set were shared and distributed in vitro to African National Agricultural Research Services (NARS) through activities supported under the GCP. This has been a significant achievement, given the susceptibility of Latin American germplasm to CMD. Of the more than 1,000 genotypes introduced into African NARS, several genotypes have been integrated into multi-stage breeding activities, some to the point of varietal selection by farmers in Nigeria, Tanzania, Ghana and Uganda. One of the Latin American-derived cassava varieties, CR41-10 (UMUCASS 33) was released in Nigeria in 2010 and represents the first Latin American variety officially released in Africa. The markers are being used to transfer CMD resistance into desirable genetic backgrounds of East African farmer-preferred varieties with CBSD tolerance in order to combine resistance to both viral diseases.

The threat of CMD requires improved resistance and enhanced durability and has prompted further screening for new sources of CMD resistance in Nigeria. Molecular marker analysis has identified a new source of CMD resistance (CMD3) in TMS 97/2205, an IITA-developed variety found to show high CMD resistance in different ecologies that have high to very high disease pressure in Nigeria (Egesi et al. 2007). The near immunity of this variety to CMD has been attributed to the combined effects of CMD2 and CMD3 loci. Gene pyramiding, the process of combining several genes together into a single phenotype (Collard and Mackill 2008), has been initiated for CMD resistance breeding using both CMD2 and CMD3 genes for enhanced durability and stability of CMD resistance.

Results from MAS-bred CGM genotypes indicate variation in response to the pest. Progenies selected with the markers for CGM resistance tended to show good resistance to the pest in East Africa in contrast to the moderate tolerance observed for CGM in West Africa (Okogbenin, unpublished). The phenotypic differences between African sub-regions could be due to variation in CGM pressure, which was higher at the Umudike test site in Nigeria than in Mtwara and Chambeze, Tanzania, and Namulonge, Uganda.

Gene Mining of Wild Relatives for Gene Pool Development, using MAS

Wild Manihot germplasm offers a wealth of useful genes for agronomic cassava. Key target breeding traits have been discovered in wild accessions of cassava, including high levels of protein in M. esculenta sub spp. flabellifolia, M. peruviana and M. tristis (CIAT 2004), low amylose corn starch (3–5%) or waxy starch in M. crassisepala and M. chloristicta, and delayed PPD in an interspecific hybrid between cassava and M. walkerae (Bertram 1993). Moderate to high levels of resistance to CGM, whiteflies and the cassava mealybug have been found in interspecific hybrids of M. esculenta sub spp. flabellifolia. The use of wild species in breeding programs is restricted by linkage drag, requiring pre-breeding activities. However, the use of molecular markers to introgress a single target region of the genome can save two to four backcross generations (Frisch et al. 1999). It is possible that the genetic potential of wild relatives can be released by an advanced backcross QTL mapping scheme (ABC-QTL) in which markers are used for both foreground and background selection (Tanksley and McCouch 1997). ABC-QTL has been used at CIAT to introgress genes for protein content, waxy starch and delayed PPD. Genotypes with QTL of interest and minimum donor parent genome were selected and used for generating advanced backcross populations (Blair et al. 2007).

In the case of introgressing a naturally occurring mutant that creates granule-bound starch in wild relatives, a highly targeted approach was adopted. Sequencing of the glycosyl transferase region of the GBSSI gene from wild relatives and two cassava accessions resulted in the identification of four SNPs that differentiated wild accessions from cassava. These were used to develop allele-specific molecular markers for MAS (Blair et al. 2007). Such allele-specific markers may be used to select genotypes that harbor the recessive mutant gene for future selfing to recover waxy starch. This approach represents an innovative molecular tool to accelerate the introgression of favorable alleles from wild relatives into cassava.

Estimation of Heterozygosity and Development of Partial Inbreds

Cassava genotypes are heterozygous and show extensive segregation in F1 breeding populations, making breeding for complex traits very uncertain. Inbred lines are preferred as parents since they do not have the confounding effect of dominant traits masking recessive ones and carry less genetic load. The breeding value (for the trait) of these homozygous S1 genotypes doubles (if the assumption of heterozygosity for the trait in the elite S0 genotype holds true). At CIAT efforts are underway to reduce high segregation in the seedling nurseries and to minimize MAS cost at the early stages of the breeding scheme. To do this, favorable alleles were fixed at six target marker loci for both CMD and CGM. Castro et al. (unpublished) showed that after one generation of selfing (S0 to S1), markers revealed that the general reduction of heterozygosity was 50% by inbreeding effect. Selection for inbreeding tolerance is biased by the differences in homozygosity levels of segregating partially inbred genotypes. Markers can be used to estimate heterozygosity in selfed lines to permit co-variance correction in the selection of phenotypically vigorous genotypes. Molecular markers are presently being used to assess homozygosity of selfed populations in the development of partial inbred lines in Africa. The markers will be used to determine regions in the genome that are particularly related to the expression of heterosis and to measure genetic distances among inbred lines such that crosses can be conducted with higher probabilities of success (Ceballos et al. 2004). It is hoped that this will improve the prospects of developing seed technology for cassava involving the use of seeds as propagules and of hybrid development.

Marker Assisted Recurrent Selection (MARS) for Complex Traits

MAS is useful for pyramiding genes of relatively large effect, such as disease resistance genes (Jia et al. 2002; Komori et al. 2003; Murai et al. 2001). However, most agronomic traits are quantitative in nature, and their manipulation has been challenging due to a complexity of interactive factors such as epistasis, pleiotropy and genotype-by-environment interaction. Breeding for complex traits is expensive due to the need for highly replicated phenotyping trials over several environments. This is driving the quest for a MAB approach that increases precision of selection and reduces the requirement for phenotyping.

MARS is a MAB strategy for forward breeding of genes and QTLs for relatively complex traits (Crosbie et al. 2006; Eathington 2005; Ragot et al. 2000; Ribaut and Betran 1999). Here, QTL mapping is performed in the F1 from a biparental cross in which both parents contribute favorable alleles with the ideal genotype being a mosaic of beneficial alleles from both parents (Ragot and Lee 2007). Several generations of crossing and genotypic selection are done for each phenotyping trial. MARS is essentially a genotype construction process that leads to an increase in the frequency of beneficial alleles and to the development of individuals having the best haplotype combination at selected loci in the genome. The principle can be extended to multi-parental populations where favorable alleles come from more than two parents (Peleman and van der Voort 2003). A typical MARS scheme is illustrated in Fig. 1. Under the GCP - Cassava Challenge Initiative, African breeding programs have initiated MARS for drought tolerance breeding in cassava. SSR and SNP markers are used to identify QTLs and then to identify important allele combinations through three cycles of selection, which is only then followed by phenotyping.

Fig. 1
figure 1

MARS scheme (Adapted from Integrated Breeding Platform – Generation Challenge Programme)

Genomic Selection for Complex Traits

A proposed alternative approach to dealing with multiple loci conferring small effects is referred to as ‘genomic selection’ (GS) (Meuwissen et al. 2001). This approach is facilitated by new high-throughput genotyping and novel statistical methods. Unlike traditional MAS, which relies on knowledge of individual loci associated with a specific trait, GS uses all marker data as predictors of performance, thus enabling the selection for multiple loci of small genetic effect (Jannink et al. 2010). Essentially, breeding populations are extensively genotyped (possibly using GBS) to give full genome coverage and phenotyped to create models that calculate genomic estimates of breeding values (GEBVs). GEBVs are then used as a criterion for selecting candidate parents. These values can then be used for selection within a breeding population, without the need for phenotypic evaluation. Simulation and empirical studies indicate that GS can lead to a considerable increase in the rate of genetic gain while dramatically reducing the need for phenotypic evaluation. Benefits of genomic selection are being experienced in the dairy cattle industry (Hayes et al. 2011). In addition, though GEBVs do not show the effects of underlying genes, simulation studies calculate that it is remarkably accurate (Habier et al. 2007; Zhong et al. 2009). This new breeding approach has been comprehensively reviewed (Heffner et al. 2009; Jannink et al. 2010) and should provide significant benefits in breeding for some highly quantitative traits in cassava.

Conclusions

Previously, molecular markers, predominantly SSR markers, have been used in cassava to understand genetic diversity and differentiation in populations, to map QTLs associated with large effect genes and to mine genes from wild species. In the past, marker numbers have been a limitation. The advent of relatively low-cost, massively-parallel, high-throughput genome sequencing has made high-density SNP discovery feasible. Simultaneously, relatively low cost, flexible, low to medium density SNP genotyping technologies have been developed. It is envisaged that these platforms, at least in the near to medium term, will serve the cassava molecular breeding community for short-term breeding applications such as MAS and MARS. This has heralded a new era for the application of molecular markers to plant breeding. As GBS and whole genome re-sequencing become more available, they are likely to enable the application of genome-wide marker selection, particularly useful for more complex traits. With the availability of the cassava genome sequence, the cassava community is poised to take advantage of these new tools for rapid progress and genetic gain. As described above, we envisage the application of these tools in many different ways, including the development of high-density maps and fine mapping, association mapping, exploration of the genome sequence for gene discovery, transcript profiling, inbred line development and the prediction of heterosis, gene mining in wild species and introgressions, short-term breeding applications such as MAS and MARS, and genome-wide selection approaches such as genomic selection. Some of these applications are already underway in cassava.