Keywords

1 Association Genetics

Many genetic mapping studies in plants have been conducted with recombinant inbred line (RIL) populations from a biparental cross because it is easy to maintain these populations for replicated trials (Bernardo 2008; Holland 2007). In contrast, association genetics has been implemented extensively in human genetics studies, partly because of the early adoption of large-scale genotyping strategies and the necessity of exploiting population-based samples for studying complex human diseases. However, widespread use of single nucleotide polymorphism (SNP) markers and the reduced cost of sequencing and genotyping have led researchers working with different plant species to adopt association mapping and the underlying linkage disequilibrium (LD) approach (Zhu et al. 2008).

Here we briefly introduce the concept of linkage analysis and association mapping. Readers should refer to other detailed reviews for a full explanation (Flint-Garcia et al. 2003; Nordborg and Tavare 2002; Risch and Merikangas 1996; Zhu et al. 2008). In essence, both linkage analysis and association mapping strategies are designed to identify marker–trait association signals that result from co-inheritance of functional polymorphisms and neighboring DNA variants (markers). In linkage analysis in plants, the signals are typically generated by co-inheritance within a segregating population. This segregating population starts with the cross of two homogenous inbred parents and contains one or more generations of recombination. Association mapping is aimed at detection of marker–trait association signals within a broad collection of accessions—natural populations, landraces, breeding lines, or a combination of these. The reasoning behind this approach is that historical recombinations and genetic diversity captured in this collection would allow fine map resolution because any markers that are not tightly linked to true functional polymorphisms would not generate any strong signals (Risch and Merikangas 1996; Zhu et al. 2008).

2 Association Genetics in Plants

Several recent review and perspective papers have documented the current status of association mapping and pointed out challenges that need to be addressed in plants (Myles et al. 2009; Nordborg and Weigel 2008; Zhu et al. 2008). Nevertheless, because genome sequence projects have been completed for the model plant species Arabidopsis (The Arabidopsis Genome Initiative 2000) and several major crops, including rice (Goff et al. 2002; Yu et al. 2002), sorghum (Paterson et al. 2009), maize (Schnable et al. 2009), and soybean (Schmutz et al. 2010), and because genomes of many other plants are being sequenced, association genetics, both as a general strategy in complex trait dissection and as a complementary approach to other existing tools, is expected to attract further attention.

A recent review documented association genetics studies in plant species such as maize, Arabidopsis, sorghum, wheat, barley, potato, rice, loblolly pine, sugarcane, eucalyptus, and perennial ryegrass (Zhu et al. 2008). Association mapping panels in different crops have been established as community resources, and findings from these studies are promising. Regardless of the degree of LD in mapping panels of different crops—for example, fast decay in diverse maize lines in thousands of basepairs (Yu and Buckler 2006) or slow decay in wheat breeding lines in centimorgans (Sorrells and Yu 2009)—association genetics has been embraced as a powerful addition to the genetic analysis toolbox.

2.1 Association Mapping Procedures

General components of association mapping include germplasm, genotyping, phenotyping, and analysis (Fig. 9.1) (Zhu et al. 2008). Unlike linkage mapping, association mapping usually involves assembling a collection (or population but without referring to segregating populations) of ready-to-measure accessions or lines rather than developing an F2, BC, or RIL population. A newer strategy that combines features of association mapping with diverse lines and linkage analysis with segregating populations will be introduced in Sect. 2.4.

Fig. 9.1
figure 1

General procedures of association mapping and highlights of different steps

Genotyping in association mapping also represents a significant departure from traditional linkage analysis. The marker density requirement for a robust analysis is generally much higher for association mapping than for linkage analysis even though low-density genotyping with random markers across an association mapping panel mimics the genotyping process of traditional linkage analysis with an F2, BC, or RIL population. For association mapping, such effort is primarily geared toward assessing population structure and genetic relatedness of a collection by examining the marker information collectively (Yu et al. 2009), not toward testing these markers individually for marker–trait association unless a very significant number of markers are used (Zhu et al. 2008). As we explain in the next two sections, markers to be tested for marker–trait association could be from candidate genes, regions implicated in previous linkage mapping studies, or a large number of markers across the whole genome.

A well-designed association panel will have extensive phenotypic diversity. Phenotyping of such diverse materials is challenging given the broad variation in photoperiod sensitivity, flowering time, and market type and other existing differentiation within the collection (Myles et al. 2009; Zhu et al. 2008). As a result, field design, appropriate blocking, timing of record taking, data analysis, and interpretation of results all demand more effort. Data analysis for association mapping involves (1) marker data analysis such as population structure (Q), relative kinship (K), principal component analysis (P), or multidimensional scaling analysis (M); (2) trait data processing such as multiple environment data analysis; (3) model testing for appropriate models (i.e., Q, P, M, K, QK, PK, and MK); and (4) marker–trait association testing. Readers should refer to recent research and review articles for detailed information on algorithms and software packages that are commonly used in plants (Bradbury et al. 2007; Yu et al. 2006; Zhu et al. 2008).

2.2 Candidate Gene Association Mapping

In association mapping, candidate genes or regions can be targeted on the basis of metabolic and biochemical pathways, mutational studies, linkage analysis results, and genome sequence annotations from either the species of interest or relevant models (Zhu et al. 2008). This is a trait-specific, hypothesis-driven approach. As we stressed in Sect. 2.1, an adequate number of background markers need to be genotyped and analyzed for population structure and relative kinship to ensure that tests of candidate gene SNPs are valid. Recent examples of candidate gene association mapping include carotenoids in maize (Harjes et al. 2008; Yan et al. 2010) and eating and cooking properties in rice (Tian et al. 2009). In these studies, well-characterized pathways provided excellent starting points for candidate gene selection.

2.3 Genome-Wide Association Study

Association mapping can be conducted by genotyping all individuals with tens of thousands of SNPs instead of focusing on candidate genes or regions. Genome-wide association studies have been extensively conducted to dissect the genetic causes of complex human diseases for many years (Manolio et al. 2009; Wang et al. 2005). But for plants, Genome-Wide Association Study (GWAS) at a decent scale have been completed only in the model plant species Arabidopsis (Atwell et al. 2010; Zhao et al. 2007). Genome-wide association studies represent an important advance from candidate gene studies or family-based linkage studies. Large-scale GWAS typically validate findings of previously identified genes and generate new signals and hypotheses for further investigation. However, results from GWAS have also raised some concerns about the potential limitation of association mapping, termed “missing heritability.” The classic example of missing heritability is the mapping of human height (Manolio et al. 2009). Forty loci have been implicated in controlling adult height variation, but together they explain only 5% of phenotypic variation even though the estimated heritability of this trait is about 80% (Visscher 2008). Potential causes of this problem include rare allele frequency, epistasis, sample size, structure variants, and the interaction between genotype and environment. Strategies and methods in GWAS are evolving to address these concerns. It is still too early to know whether GWAS in plants will be subject to the same concerns.

2.4 Nested Association Mapping

Nested association mapping (NAM) is a special case of joint linkage and linkage disequilibrium mapping and is well suited for many plant species (Yu et al. 2008). The essence of NAM is to combine the merits of linkage analysis with designed populations and association mapping with assembled germplasm. First, a set of diverse inbred lines is selected as founders and crossed according to genetic designs (e.g., Reference Design, Design I, Design II, Diallel, Single Round Robin, or Double Round Robin). Then, RIL are developed from each cross. Genotyping the founders and RIL with a smaller set of tagging markers makes it possible to track the recombination of chromosome segments. Further, genotyping the founders with a much larger set of markers permits this marker information to be projected onto the tagged chromosome segments of the RIL. Finally, the projected marker data are combined with the phenotype data of the RIL for high-resolution mapping (Yu et al. 2008). NAM is considered a major tool for next-generation genetics (Nordborg and Weigel 2008). Multiple-family analysis of maize NAM populations provided tremendous power and precision in revealing the multigene nature of flowering time (Buckler et al. 2009; McMullen et al. 2009).

3 Resources and Examples in Sorghum Association Genetics

Over the past decade, several groups have created resources for association mapping in sorghum. These resources include carefully selected, diverse germplasm collections characterized for population structure on the basis of variation at genome-wide molecular markers as well as experimental mapping populations for joint linkage and linkage disequilibrium studies. Thanks to collaborative efforts among some of these groups, very valuable germplasm and marker resources are or will soon be publicly available for sorghum, opening the door to the integrated study of sorghum and maize and allowing incorporation of new genetic resources into sorghum breeding programs.

3.1 Linkage Disequilibrium in Sorghum

The extent of linkage disequilibrium is a key factor in the design and implementation of association genetics strategies. Given sorghum’s lower estimates of sequence variation, which implies a smaller effective population size, as well as its predominately self-pollinating mating system, LD in sorghum was expected to be more extensive than in maize (Hamblin et al. 2004). This expectation was confirmed in several studies that used resequencing data from diverse sorghum lines and showed strong LD between sequence polymorphisms (i.e., SNPs) within gene-sized regions. An early study examined LD within six unlinked regions ranging in size from 40 to 100 kb and estimated the population recombination parameter, 4N e r, also called ρ (Hamblin et al. 2005). This parameter is useful because it summarizes LD across an entire region and has an expected relationship with the important parameter r 2: E[r 2 ]  =  1/(1  +  4N e r). Estimates of ρ corresponded to expected r 2 values ranging from 0.14 to 0.71 for loci 10 kb apart with an average expected r 2 value of 0.25. At a distance of 1 kb, the average expected r 2 value was close to 0.8.

Although expected values provide useful information, it is critical to realize that there is tremendous variation in LD patterns across the genome. This is especially true in sorghum, a species that has recently experienced demographic events (e.g., domestication) that have dramatically perturbed its patterns of variation (Hamblin et al. 2006). Furthermore, the domestication bottleneck has had the effect of creating perfect LD (i.e., r 2  =  1) for sets of SNPs that have different mutational histories. For sets of SNPs that are closely linked, there has not been sufficient recombination to break up those haplotypes (Fig. 9.2).

Fig. 9.2
figure 2

Short-range linkage disequilibrium (r 2) as a function of distance. Data were pooled from six unlinked regions (Hamblin et al. 2005)

In another study, 15 genes in the starch metabolism pathway were sequenced in 23 lines, mostly diverse cultivars (Hamblin et al. 2007). Within 11 genes that each spanned up to 12 kb, more than 40% of SNP pairs were significantly associated at the 0.05 significance level. Haplotype structure was strong in most genes, and recombination was evident only in five genes. However, LD patterns varied widely across these regions (Fig. 9.3). In a similar study conducted to survey variation in six genes in a much larger number of lines (N  =  129–184), little evidence of recombination was found (de Alencar Figueiredo et al. 2008).

Fig. 9.3
figure 3

Patterns of linkage disequilibrium vary for different genome regions. (a) Starch synthase III (Sb07g005400), 10 kb span; (b) debranching enzyme (Sb06g001540), 12 kb span; and (c) glucose phosphate transferase (Sb07g005200), 3 kb span. Figures were made in Haploview (Barrett et al. 2005). Color indicates the value of r 2 (white  =  low, black  =  high)

The euchromatic regions in sorghum, which account for 97% of the genetic map, total about 252 Mb in length (Paterson et al. 2009). These LD studies suggest that a marker density for GWAS in diverse sorghum of 1 per kb in euchromatic regions, or about 250–300k SNPs, should be adequate. On the other hand, resequencing studies have revealed that there are regions in the genome where no common SNPs occur over distances of several thousand basepairs or greater. In these regions, which either have experienced selection or simply contain low variation because of genetic drift, we will have limited ability to identify genetic markers and subsequent marker–trait association signals. This is also true for centromeric regions.

3.2 Sorghum Diversity Panels

In recent years, several papers have been published that report the collection and characterization of diverse sorghum germplasm collections designed explicitly for use in association genetic studies. The Centre de Coopération Internationale en Recherche Agronomique pour le Développement (CIRAD) assembled a core collection of 210 landraces that are representative of race, latitude of origin, response to day length, and production system and characterized them with restricted fragment length polymorphism (RFLP) probes (Deu et al. 2006). These lines represent a subset of a larger collection that was developed by the Generation Challenge Program (GCP).

A collaboration of US institutions assembled a collection of 377 accessions that represent species-wide diversity for panicle architecture and other morphological features (Casa et al. 2008). To facilitate phenotypic characterization in temperate regions, this collection was purposely assembled with accessions from sorghum conversion program and lines of historical importance in sorghum breeding. The whole collection was characterized with 47 simple sequence repeat (SSR) markers. Although the population structure is specific to the particular composition of the population, the general patterns of population structure identified in these two studies were similar. As expected, the genetic clusters correspond to the geographic and racial groupings identified in many diversity studies.

While population structure in sorghum is not strong in comparison with other self-pollinating crops like barley and rice, it is sufficient to generate modest levels of long-range LD due to admixture in panels of diverse lines, which can lead to spurious associations with phenotypes. Preliminary analysis of the US sorghum diversity panel indicated that current mixed-model methodology, accounting for both population structure and relative kinship, can adequately control for the level of population structure present in these panels (Casa et al. 2008).

A GCP-funded consortium studying genetic factors underlying drought and aluminum tolerance in sorghum is using a combined panel that includes most of the lines from the CIRAD and US panels. This consortium, led by researchers at the United States Department of Agriculture (USDA) and the Empresa Brasileira de Pesquisa Agropecuária (EMBRAPA) (Brazilian Enterprise for Agricultural Research), is also developing markers to carry out a GWAS (see Sect. 3.5). Initial genotyping with a low density of SNP markers is sufficient to find only a tiny fraction of QTL; however, LD studies with these marker sets are beginning to provide a detailed view of LD in this set of germplasm, revealing how many more markers will be necessary for whole genome coverage. This higher coverage will likely be obtained through genotyping-by-sequencing technology rather than array-based SNP genotyping technology.

The combined CIRAD-US panel of 480 lines will be made publicly available through the Germplasm Resources Information Network (GRIN), the USDA germplasm system (the US panel is already available as Sorghum Association Panel). Marker data for each line will also be made publicly available, providing a resource for further association studies of the wide phenotypic variation captured in this collection.

A research group in Japan has also assembled a sorghum collection from 3,500 sorghum lines preserved at Genebank, National Institute of Agrobiological Science (NIAS), Japan. These lines are primarily from Asian and African sources. From an initial set of 320 lines selected on the basis of geographic distribution, 107 were chosen on the basis of diversity at 38 SSR markers (Shehzad et al. 2009b). Because this core collection is drawn from such a small germplasm collection (the US and CIRAD panels are both drawn from collections of more than 36,000 accessions) and the final size is also smaller than the US and CIRAD panels, the reduced diversity level in this NIAS panel is not unexpected. Structure analysis suggested that the NIAS lines came from three subpopulations, whereas the CIRAD and US populations appear to form nine or ten clusters. Although the NIAS population is quite small and LD is not especially extensive, QTL for 12 morphological traits were detected (Shehzad et al. 2009a).

Finally, the International Crops Research Institute for the Semi-Arid Tropics (ICRISAT) has developed a sorghum mini core collection of 242 accessions that can potentially be used for association mapping (Upadhyaya et al. 2009). This mini core collection was selected from a core collection of 2,247 accessions with phenotypic data measured for 11 qualitative traits and 10 quantitative traits. Genotyping the mini core collection and the core collection and extensively phenotyping the mini core collection would be the next steps toward using the diversity captured in these collections and inferring marker–trait associations.

3.3 Sweet Sorghum Diversity Panel

There is an emerging emphasis on using sorghum as a dedicated bioenergy crop (Carpita and McCann 2008; Rooney et al. 2007), and two sweet sorghum diversity panels have been studied (Murray et al. 2009; Wang et al. 2009). In the first study, a panel of 125 sorghum accessions was genotyped with 47 SSRs and 322 SNPs and phenotyped for brix and plant height (Murray et al. 2009). Population structure analysis indicated this panel contains three major groups of sorghum accessions: historical and modern syrup, modern sugar/energy types, and amber types. In the second study, 96 sweet sorghum accessions, an initial sample from the US historic sweet sorghum collection, were genotyped with 95 SSRs and phenotyped for flowering time, plant height, and brix (Wang et al. 2009). Although molecular marker analyses revealed weak population differentiation among these 96 accessions, the combined assessment and model testing of these three phenotypes demonstrated that this sweet sorghum panel can be classified as a type II association sample with a low level of relatedness (Zhu and Yu 2009). Genotyping experiments that include additional accessions from the US historic sweet sorghum collection are currently being conducted to expand these efforts.

3.4 Sorghum NAM Panel

On the basis of results from population structure analysis of 377 sorghum accessions using 47 SSR markers (Casa et al. 2008) and breeders’ knowledge about these accessions, 10 diverse founders (SC283, SC1103, Segaolane, Macia, SC35, Ajabsido, SC971, SC265, SC1345, and P898012) were chosen from different subpopulations and crossed to the common parent, RTx430, to create the sorghum NAM population (Table 9.1). From each cross, 200 RIL were derived to form a sorghum NAM panel with 2,000 RIL (Fig. 9.4). A complementary set of RIL were planned to be derived from the crosses of the common parent Tx623 with ten different diverse founders (WL Rooney, personal communication). Each of these lines represents a subgroup identified for the sorghum diversity panel (Casa et al. 2008).

Table 9.1 Diverse founders were chosen from different subpopulations and crossed to a common parent, RTx430, to create the sorghum NAM population
Fig. 9.4
figure 4

Schematic diagram of sorghum NAM population development. The genome of each founder is color coded to show that the genome of RIL is a mosaic of founder genome segments. High-density genotyping of founders permits the linkage disequilibrium information captured in these diverse founders to be exploited for high-resolution mapping

Tx430 (Miller 1984) has been widely used as a pollinator parent to produce sorghum hybrids in the USA; it is amenable to genetic engineering though both microprojectile bombardment and Agrobacterium approaches. Segeolane is a drought-tolerant kafir-type sorghum from Southern Africa (Gowda et al. 2009). Macia is a food-grade sorghum cultivar developed and selected in Tanzania for its early maturity characteristics and excellent taste attributes. Macia is high yielding and possesses preferred traits such as cooking quality and malt production characteristics for use in brewing (Bucheyeki et al. 2010). SC35 is drought resistant and has been used as a staygreen trait donor in sorghum breeding programs in the USA and Australia. Ajabsido is from Sudan and possesses excellent pre-flowering drought tolerance (Gowda et al. 2009). P898012 is well adapted to production environments in Niger and Sudan; it has both pre-flowering and post-flowering drought resistance and is amenable to transformation by both microprojectile bombardment (Casas et al. 1993) and Agrobacterium (Zhao et al. 2000). SC283 is a conspicuum-type sorghum from Tanzania and it expresses excellent tolerance to acid soils and aluminum toxicity (Bernai and Clark 1998). The rest of the NAM parents (SC1103, SC971, SC265, and SC1345) are from the sorghum conversion program but have not been well documented in literature.

3.5 SNP Genotyping Array

Using SNPs discovered in Sanger resequencing studies of more than 300 loci in samples of 16–30 sorghum accessions, Hamblin (unpublished) designed 384-SNP genotyping assays using the Illumina GoldenGate platform. These SNPs represented about 220 loci including several candidate genes each with several SNPs. More than 80% of the assays were successful. These data have been used in candidate gene-based association studies of stem sugar (Murray et al. 2009) and endosperm carotenoid content (Salas-Fernandez, unpublished) and in a study of population structure (Brown and Myles, unpublished).

The GCP consortium and the Sorghum Translational Genomics Program at Kansas State University worked collaboratively to discover additional SNPs and develop a genotyping platform with higher density. Through this effort, Solexa sequencing of reduced representation libraries from 14 sorghum accessions, including the sorghum NAM parents, was used to discover about 34,000 high-quality, non-singleton SNPs. Discovery of SNPs in sorghum is less problematic than in maize because of the much lower level of gene duplication; most SNPs called in this analysis aligned to unique locations in the genome.

A genotyping array with 1,536 SNPs was designed to achieve maximal genome coverage (Fig. 9.5). Aside from the centromeric regions, which are very poorly represented in the SNP data, the average distance between SNPs is about 400 kb. The 480-line CIRAD-U.S. panel has been genotyped with these 1,536 markers. Much of the sorghum genome will need a much higher density of markers if we are to detect genes of modest effect underlying complex traits. This will be likely accomplished by genotyping-by-sequencing, which is quickly becoming much more cost-effective than SNP genotyping platforms such as the GoldenGate assay.

Fig. 9.5
figure 5

Distribution of 1,536 SNPs across ten sorghum chromosomes

3.6 Examples of Sorghum Association Mapping

Using the US sorghum diversity panel, Brown et al. (Brown et al. 2008) examined dwarfing gene Dw3 for its association with reduced lower internode length and elongated apex. Fine mapping of an additional dwarfing QTL, which showed epistatic effect with Dw3, successfully narrowed the region to approximately 100 kb. In another recent study, several genomic regions associated with brix and plant height were identified (Murray et al. 2009). However, the marker density in that study (47 SSRs and 322 SNPs) was still low. Further genotyping and analysis would provide additional evidence for the detected signals.

4 Opportunities and Challenges

Essential components for carrying out large-scale association mapping studies in sorghum are in place. First, several diversity panels have been established and characterized with low-density background markers. Second, various research groups have resequenced additional sorghum accessions for SNP discovery. There is no foreseeable obstacle to obtaining hundreds of thousands of SNPs for genome-wide scans for multiple traits. Genotyping arrays with different marker densities have been developed, and the density is expected to increase. In addition, genotyping-by-sequencing may soon become practical for these sorghum diversity panels. Third, our understanding of association mapping panels and analysis methods has significantly increased because of earlier empirical studies in LD and association mapping.

Phenotyping, however, remains a major challenge, especially for agronomically important traits. Obtaining robust phenotypes (e.g., abiotic and biotic stresses) for a large number of accessions requires multiple environmental trials, long-term commitment, and stable funding for a concerted research consortium. Association mapping is multidisciplinary in nature and could be difficult to implement in small research programs; many aspects of this approach deserve further attention.

Fortunately, preliminary efforts have been made to address the adequacy of background markers for estimating population structure and relative kinship (Yu et al. 2009), variation explained with mixed-model association mapping (Sun et al. 2010), and computational efficiency in large-scale, genome-wide studies (Zhang et al. 2010). As we move toward GWAS, the question of missing heritability may emerge. Quantitative genetics has played a critical role in developing plant and animal breeding methods and provides a natural framework for dissecting complex traits with high-throughput technologies. Modifying and adapting classic quantitative genetics and population genetics models and combining genomic technologies with genetic designs and experimental designs will help us find those missed heritabilities. Ultimately, association genetics is an additional strategy that needs to be combined with existing and emerging strategies to realize the full potential of ultrahigh-throughput genomic technologies in crop improvement (Yu 2009).