Keywords

1 Introduction

Crop plants evolved from their wild ancestors by the processes of domestication and selective breeding over the last ca. 10,000 years. Initially, wild plants carrying promising traits were cultivated, leading eventually to locally adapted landraces. These lost many undesirable alleles as useful alleles became enriched (Feuillet et al. 2008). Modern breeding has largely extended this by a process of crossing the ‘best with the best’ and the successes have been impressive. Unfortunately, there are indications that we are approaching a performance ceiling for at least some crops, as the best alleles become assembled in elite genetic materials (Tanksley and McCouch 1997, http://www.fao.org/ag/agp/agpc/doc/riceinfo/Asia/ASIABODY.HTM). The potential to re-invigorate these elite materials may be provided by the introduction of new alleles from wild species and old, locally adapted germplasm. Many studies have demonstrated the value of alleles originating from un-adapted and unimproved germplasm showing that centuries of selective breeding have not necessarily resulted in the accumulation of all the optimal alleles. For example, several barley cultivars have been released in Europe that contain fungal resistance genes introgressed recently from H. spontaneum (von Korff et al. 2005; Schmalenbach et al. 2009). A major challenge for the future is to streamline this process using high throughput genomics approaches.

The identification and recruitment of useful alleles are two very different tasks and both are difficult. Allele identification requires detailed and careful phenotypic trait analysis, combined with high-resolution genomic characterisation. Comparison between the phenotypic and genotypic data sets, either by linkage mapping in bi-parental populations or by genome wide association scanning (GWAS) of panels of related genotypes can in principle yield candidate marker alleles linked to the traits investigated. While the former approach has been generally successful in identification, deployment of the results in breeding has not been as widespread for many reasons, including the problems in identifying markers sufficiently closely linked for effective use in selection. The latter approach is therefore becoming more attractive because it is intrinsically higher resolution and, has the potential at least, to be more powerful because it scrutinises the results of many more generations of recombination and selection (Caldwell et al. 2006; Rostoks et al. 2006; Cockram et al. 2010). However there are also issues with GWAS that need to be resolved before it can be most effectively applied. In this chapter we will review some of the challenges that we have encountered and that need to be considered when planning to exploit genetic resources for GWAS. These are largely based on our experiences in establishing a successful GWAS programme in barley.

2 Multi Parent Populations

Over the past 25 years the correlation of phenotypic data with genetic markers in the offspring from specific bi-parental crosses using the well-established methods of ‘genetic linkage analysis’ has significantly advanced our understanding of the number, organization, location, and contribution of genetic loci to both simple and complex phenotypes (e.g. Turner et al. 2005; Yan et al. 2006). In a growing number of cases, particularly for Mendelian (i.e. single gene) traits, linkage mapping in very large populations has allowed the responsible genes to be fine mapped and ultimately to be cloned and analysed at the sequence level Komatsuda et al. 2007). This has been achievable because the large number of recombination events in such populations allows the trait gene to be positioned so accurately that it is often possible to resolve its location to a specific DNA sequence (when available) or a single large-insert DNA clone that contains only one or perhaps a few candidate genes. Successes include major disease resistance and developmental genes such as Mlo, Rpg1, Vrn1 and Ppd1 and more will continue into the future. Bi-parental mapping requires the construction of specific populations that segregate for the trait of interest and because it samples only a small portion of the genetic variation inherent in the genepool under study, different populations are frequently required for each new trait studied.

More recently, geneticists have started to investigate GWAS in an attempt to increase the resolution of primary genetic studies. In contrast to linkage analysis, association approaches evaluate the correlation between loci and/or markers in populations of plants that share a degree of common history. Populations used for GWAS include collections of related individuals within natural or constructed populations from within a species. Association mapping effectively increases the number of recombination events to include all occurrences within the history of the sample. This presents a distinct advantage over bi-parental populations by improving genetic resolution from the megabase to the kilobase scale. The resolution inherent in a population used for GWAS is largely dependent upon the phenomenon of linkage disequilibrium a measure that can itself be complicated by the history of the population and which has the potential to increase the frequency false positive associations.

3 Linkage Disequilibrium

Linkage Disequilibrium (LD) is defined as the non-independence of alleles at different loci in a population (Box 1). At its most basic level, LD is maintained as a balance between mutation and recombination. At the moment of spontaneous (or induced) generation all new mutations are in perfect association with their genetic background. However, over time the processes of recombination (during meiosis) and genetic drift gradually lead to decay in the extent of these original associations and as new mutations are generated and selected, and old ones are lost, new associations are established. LD is therefore the product of evolutionary and biological factors that together contribute to the genetic structure and allelic histories of each gene in the population. The extent of LD can be measured effectively by assaying and correlating the allelic state of genetically linked molecular markers at known genetic loci across the genome in what has been termed an association mapping panel of genotypes. When LD is extensive, statistically significant associations (correlations) may be detected between markers that are several to many centi-Morgans (i.e. potentially several megabases) apart. When it is low, associations between genes or markers may rapidly reduce to become non-significant at the sub-centiMorgan scale, or over thousands or even hundreds of bases. Within this generalised assertion, false positive associations can arise from the effects of genetic structure in the population, which may have originated from non-random mating, population bottlenecks or directional selection. As an example, up to 80 % of the significant associations detected between polymorphisms in the maize dwarf8 (d8) gene and flowering time were assessed as being due to population substructure (Thornsberry et al. 2001).

figure a

Mating system has a similarly profound impact on LD. Simulation studies have demonstrated that in the absence of mitigating factors, high levels of LD persist to a greater extent in highly selfing species (like barley), and that this is predominantly a factor of the effective recombination rate. This is simply because inbreeding results in increased homozygosity. Subsequently, as a consequence of this high homozygosity, a significant proportion of all recombination events in an inbreeding species will fail to bring about an exchange of genetic variation. This has been countered to some extent by artificial outcrossing, the basis of plant breeding practiced over the last hundred years. Therefore, in inbreeding crops like barley, while we would naturally expect LD to be extensive in natural populations, plant breeding has been effective at generating a pseudo-outcrossing population where LD has been reduced to an extent that makes it useful for medium resolution association-based approaches and the identification of correlations between trait genes and alleles at molecular marker loci (Rostoks et al. 2006).

While it is relatively easy to detect marker-trait associations if there is extensive LD this inevitably results in a lower resolution map that requires more work to pin down the allele associated with the trait under study. Natural populations (including both true wild plants and adapted cultivated landraces) contain high levels of genetic diversity and are a great potential reservoir of DNA variation for crop improvement. Because of their history (i.e. number of generations), they also exhibit less extensive LD (Morrell et al. 2005; Kraakman 2005; Caldwell et al. 2006). These are potentially valuable as populations with low LD provide an opportunity to reveal high-resolution associations. Of course, if a genome wide approach is being adopted, the number of markers needed to find any associations would need to be extremely high, which is an associated cost. This has led to the suggestion that, at least in principle, associations could be mapped to an approximate genomic location in germplasm where LD is extensive, then exact genomic regions could saturated using progressively wider germplasm with correspondingly lower LD but higher marker densities around the established location of the causal gene. In practice this has not yet been achieved.

4 Population Structure

For association mapping, the underlying population structure can be a strong confounding factor, especially for traits that have driven the geographical or environmental adaptation of the germplasm set. From a practical point of view, considerable care therefore has to be taken in choosing germplasm, avoiding—if possible—the inclusion of strong population stratification given it is a source of false positive associations. In other words, for a specific trait if there were major loci associated with genetically distinct homogeneous clusters of lines, many background markers carrying alleles exclusive to the specific clusters are also going to be associated with the trait, even though they are not causal. Not surprisingly, a number of approaches have been used to minimise these effects.

Statistical Approaches

Our genome-wide association mapping studies in barley (Hordeum vulgare) have forced us to confront the problem of population structure as a confounding factor. Barley germplasm is strongly stratified reflecting crop type (in terms of growth habit and spike morphology) and geographical origin, which is heavily linked to local adaptation of the germplasm (Fig. 10.1). For most studies, genotyping and phenotyping are conducted simultaneously. Thus, the exploration and statistical adjustment for stratification is generally conducted within the running time of a project and there is little scope for choosing a different set of lines if structure turns out to be a considerable problem. Moreover, after expensive and time-consuming data collection, a natural tendency is to want include as many data points as possible in an analysis. Thus statistical approaches that correct and/or account for the effects of population structure within association scans have guided most of the research on GWAS for the last few years. Several different approaches have been proposed in the literature (Mackay and Powell 2007). Issues however arise when the number and identity of markers that remain significant after employing different statistical population structure correction methods are either inconsistent or remove known biological factors correlated at some level with the population stratification. This can result in uncertainty over what QTL to prioritise for further studies or to use as diagnostics in Marker Assisted Selection (MAS).

Fig. 10.1
figure 1

Population structure in the cultivated elite barley genepool (523 lines with 890 non position-redundant SNPs). Three main clusters are evident based on the major biological divisions within the species

It is worth mentioning that in an association panel the ancestral marker allele frequencies are not known. Therefore even with saturated genome coverage, is it not possible to build a genetic map de novo using LD and then to use this as a framework for visualizing the location of QTL. Thus, a prior genetic map using one or several bi-parental populations needs to be built in parallel to the association mapping panel to estimate the genetic, or better physical, order of the markers in the genome, unless of course the genome sequence of the target species has been assembled. Some of the main approaches for dealing with structure are:

Structured Association

Structured association uses multiple polymorphisms assayed throughout the genome to compute statistics that capture the underlying population structure of the germplasm—introducing non-independence between genotypes as a result of common genetic background. Statistics can be then modelled within a Mixed Linear Model (MLM) framework to account for multiple levels of relatedness due to historical population structure and kinship (Yu et al. 2006). Different software/ statistical packages—for example R v 2.9.0 (http://www.R-project.org/), TASSEL v.3.0 (http://www.maizegenetics.net) or Genstat 14 (VSN International 2011)—provide different ways of correcting for population structure which can be used to assess which best suits your data. A variance covariance matrix containing coefficients of co-ancestry (kinship matrix) can be included in the mixed model to account for genetic relatedness between genotypes. Eigenanalysis uses the scores of the most significant principal components from the molecular marker matrix as co-variables in the mixed model, which is an approximation to the use of a kinship matrix. In barley, we found a mixed linear regression model (Yu et al. 2006), which accounts for multiple levels of relatedness due to historical population substructure and kinship, to perform best either implemented on its own and in combination with other methodologies. The significance threshold is usually estimated for each analysis using a Bonferroni corrected p-value of 0.05.

With the rapid increase of the amount of SNP marker data there is a need for methods that are able to cope with thousands to millions of computationally intensive analyses. To deal with this, emerging methodologies provide us with a choice of both approximate [e.g. GRAMMAR (Aulchenko et al. 2007), implemented in GenABLE (http://www.genabel.org/packages/GenABEL), P3D (Zhang et al. 2010), implemented in TASSEL (http://www.maizegenetics.net/tassel), EMMAX (Kang et al. 2010) (http://genetics.cs.ucla.edu/emmax/)] and exact methods [e.g. FMM (W. Astle & D. Balding, http://www.genabel.org/MixABEL/FastMixedModel.html), FaST-LMM (Lippert et al. 2011) (http://mscompbio.codeplex.com/), GEMMA [M. Stephens lab (http://stephenslab.uchicago.edu/software.html)] to account for structure effects.

Naive Approach

In its simplest form, the naive approach—which does not account for any population structure correction—is based on the same principles that work for bi-parental QTL mapping populations and consists of a regression of the phenotype upon the genotype to detect the QTLs. Each marker in a genetic map has a probability to be associated with the QTL of interest. The naive approach is suitable for use in the following two types of population—though some would argue that as all populations have some residual structure, a structure correction should always be applied.

Constructed Populations

New population types that capture the advantages of both linkage mapping and GWAS, and that focus on achieving high statistical power, high resolution and low population stratification have been developed in several species and have, or are, being developed in barley. Nested Association Mapping (NAM) (McMullen et al. 2009) and heterogenic stock inbred lines, also known as multi-parent advanced generation intercross or MAGIC populations overcome the handicaps imposed by stratification in natural germplasm collections (Cavanagh et al. 2008). Trait mapping using NAM and MAGIC populations is more complete due to greater genetic diversity and more precise than classical bi-parental populations. The short history of recombination gives high statistical power to QTL detection, while ancestral recombination and diversity accumulated between the parental lines provide the basis for much finer scale mapping. Rounds of inter-crossing and selfing remove long range LD present between the parental lines, and each extra generation will shuffle the genetic contribution from the founder lines more and more. For NAM in Maize, twenty-five diverse lines were crossed to B73 and the F1 plants self-fertilized for six generations to create a series of twenty-five recombinant inbred line (RIL) families ultimately totalling 5000 individuals. In MAGIC populations a complex and time-consuming crossing scheme has to be implemented to avoid the creation of clusters of highly related progenies that could potentially introduce de novo germplasm stratification.

Sub-Populations

Artificial out-crossing imposed by breeders coupled with the long recombination history of crop germplasm can create a highly diverse germplasm stock without major population sub-divisions. Assembling a population of this type is the approach we have taken. By exploiting the European elite two-rowed spring barley genepool, our association mapping population effectively behaves like a heterogenic stock inbred line population without strong stratification. It lacks confounding population effects and its assembly avoided complex and time-consuming crossing schemes. Most important from our point of view was that it enabled us to perform QTL analysis and discovery in a germplasm set that was directly related to the contemporary barley breeding genepool. We explored population structure in a large set of germplasm then used phylogeny, principle coordinates and STRUCTURE analyses to explore stratification and admixture in the germplasm, then chose to remove outlying lines from the final panel that we now use routinely for association mapping studies.

5 Genetic Markers

Given the increased resolution in association mapping panels to maximise the chances of exploiting it effectively, it is important that the number of molecular markers used for analysis is sufficient to exploit the number of recombination events. An early attempt at an association analysis in barley was by Kraakman and colleagues (2004). Using sparse genome coverage they reported a number of significant associations for yield and stability of yield with a number of AFLP loci. They claimed some correspondence of the position of these loci with known QTL from biparental mapping studies but this assertion was complicated by a lack of common markers. In a subsequent study using the same material they reported marker loci significantly associated with Barley Yellow Dwarf Virus resistance and quantitative measures of leaf rust resistance (Kraakman et al. 2006). Again some correspondence of positions with previous studies was claimed but in one instance the particular AFLP locus had been previously reported to be the peak marker for Rphq2, a major QTL for partial resistance to P. hordei. The most important limitation in these early studies was that the marker technology employed, AFLP, is not well suited to this application.

A breakthrough came with the development of highly parallel SNP assay systems such as the Illumina GoldenGateTM assay implemented with their oligo pool array technology (Fan et al. 2003; Rostoks et al. (2006) and Close et al. (2009)) used alignments between barley EST sequences to identify SNPs and used these to generate two 1536 SNP barley oligo pool assays (BOPA1 and BOPA2). Using BOPA1 on a relatively small population of barley cultivars Rostoks et al. (2006) successfully identified associations between a cluster of CBF genes responsible for winter hardiness in barley by GWAS after classifying the genotypes according to their spring or winter growth habit. Since then, more dense arrays of markers have been produced for application in GWAS. For example, we recently exploited Illumina GAIIx RNA-seq datasets from a range of barley cultivars to identify > 30,000 robust SNPs and incorporated approximately 8,000 of these on a higher density SNPplatform called a 9K iSELECT Infinium array (our unpublished results). It is likely that similar but higher density chips with > 30,000 SNPs will be developed in the near future.

However there is some debate over whether this platform is the best in the longer term. As the cost of generating high coverage genome sequence continues to drop, we and others have turned to another approach termed Genotyping-by-Sequencing (GbS) (Elshire et al. 2011). GbS promises even deeper depth of coverage of polymorphic sequence information while avoiding the serious issue of ascertainment bias inherent in SNP chip platforms (see below). The disadvantage at the current moment in time is that the informatics pipelines required to analyse GbS datasets require custom scripts, generally written by specialists in the labs pioneering the approach. In contrast, Infinium array development is accompanied by an ‘out-of-the-box’ software suite from the vendor that enables simple allele calling and QC along with easy export into various analytical packages. Of course, this situation will rapidly change as more individuals adopt the GbS approach.

6 Ascertainment Bias

The development of multiplex assays such as the Infinium chip discussed above generally involves mining data extracted from a limited number of individuals. The utility of the SNP sets thus obtained is affected by the parameters of this discovery protocol. SNPs are generally identified in a discovery panel, which consists of a small sample of individuals from a population. As this panel represents only a subset of the individuals, only a fraction of total polymorphisms will be discovered. Consequently, when these SNPs are then genotyped on a larger sample of individuals an ‘ascertainment bias’ is introduced (Nielsen 2000). Because the discovery panel is small, the probability that a SNP will be identified in this panel is a function of the allele frequency. Thus, rare SNPs will go undiscovered more often than common SNPs. When a SNP platform developed this way is then used to screen a much broader set of germplasm, the introduced bias may compromise measures of relatedness and genetic diversity. This is largely because statistical measures that rely on allele frequency, such as nucleotide diversity, population genetics parameters and linkage disequilibrium will be affected, and have been observed (Nielsen 2000; Schlotterer and Harr 2002; Rosenblum and Novembre 2007; Storz and Kelly 2008). In barley BOPA1, BOPA2 and the recent 9K iSelect platform have also been selected from a limited number of barley accessions (Rostoks et al. 2005, 2006; Close et al. 2009; Waugh et al. unpublished data). These SNPs have provided extensive genome coverage and have dramatically progressed our understanding of the distribution of genetic diversity within the barley genepool. Indeed several large scale projects have already used these platforms to identify marker-trait associations in elite cultivars (AGOUEB, http://www.agoueb.org; BarleyCAP, http://barleycap.cfans.umn.edu; ExBarDiv: http://pgrc.ipk-gatersleben.de/barleynet/projects_exbardiv.php) (Waugh et al. 2010). We should be mindful that the extent and patterns of diversity observed will be limited by such ascertainment issues present in the underlying data.

Particularly problematic is the use of SNPs ascertained from the cultivated genepool to examine diversity outside of that genetically narrow set. In barley we are fortunate to have extensive collections of wild progenitors collected from the Mediterranean basin through south western Asia and eastwards as far as Tajikistan and the Himalayas, as well as locally cultivated landraces grown throughout the marginal regions of the Fertile Crescent. Understanding the genetic diversity within these, particularly the landrace collections that grow and yield under extreme conditions of temperature and water availability, will be important in future breeding programmes that seek to respond to a range of environmental challenges.

Moragues et al. (2010) evaluated the effects of SNP number and selection strategy on estimates of germplasm diversity and population structure for different types of barley collections. Using the 1536 BOPA1 SNP data and various subsets of 384 and 96 SNPs that could in principle be used for affordable middle-throughput genotyping platforms, they compared diversity statistics for 161 landraces from Jordan and Syria with 171 European cultivars. Differences were observed in patterns of SNP polymorphisms as well as a lower estimate of diversity in the landraces, contradicting previous studies using SSRs (Russell et al. 2003). This bias could be at least partially nullified by selecting an appropriate subset of SNPs. All marker subsets gave qualitatively similar estimates of the population structure in both landraces and cultivars. Russell et al. (2011) described the first application of the BOPA1 SNP platform to assess the evolution of barley in a portion of the Fertile Crescent, by genotyping geographically matched landrace and wild barleys (448 accessions) from Jordan and Syria. The question of ascertainment bias skewing the landrace-wild comparison, through greater ‘pruning’ of rarely polymorphic markers in wild germplasm and generating an underestimate of genetic diversity, was addressed. While they were unable to exclude this possibility, their data did show higher levels of genetic variation in wild material suggesting that the relative pruning of SNPs in wild compared to landrace barley is most likely limited. Furthermore, the difference in diversity levels between landrace and wild barleys was similar to that found in previous work (Russell et al. 2004).

In this particular study they wanted to examine diversity across the genome and particularly in regions that have been identified as playing a role in domestication. If the effect of bias, introduced by choosing SNPs polymorphic in elite cultivars was likely to be problematic, the result would be a reduction of diversity in wild compared to landraces around the domestication genes; countering the objective of the study. They identified 141 cases where rolling diversity estimates were significantly different between wild and landraces, with diversity higher in wild material the vast majority (132 cases). Many were in regions of the genome where domestication genes are found. With the possibility of ascertainment bias pushing the comparison in the other direction, this result therefore becomes doubly significant.

7 GWAS

The feasibility of mapping Mendelian traits that are determined by single major genes by GWAS using panels of barley cultivars was clearly demonstrated by mapping SNP polymorphisms in germplasm collections by LD to positions that corresponded exactly to locations previously assigned by biparental genetic mapping (Rostoks et al. 2006; Waugh et al. 2010). This approach has been subsequently extended to analysis of simple and more complex phenotypic traits

GWAS for Simple Phenotypes

In the first reported study, Kraakman et al. (2006) used a Pearson correlation coefficient between vectors of the phenotypic response and genetic markers, correcting for multiple testing and population structure, to identify a significant association between the DUS character ‘rachilla hair length’ and the microsatellite BMAG223. Subsequently, we used GWAS to investigate the morphological differences that are used for the characterisation of cultivars in tests of Distinctness, Uniformity and Stability (DUS). DUS characters form a ready source of highly heritable traits that are presumed to be under the control of a limited number of major genes. Cockram et al. (2010) used 490 cultivars (both winter and spring) that had been genotyped with BOPA1 revealing 1,111 sufficiently informative markers. GWAS using a mixed model to correct for population substructure identified fifteen traits that had clearly significant associations with specific genomic regions. The majority of these traits appeared to identify a single genetic locus. They included ‘seasonal growth habit’ (1H), ‘grain lateral nerve spiculation’ (2H), ‘grain aleurone colour’ (4H), ‘hairiness of leaf sheath’ (4H), ‘rachilla hair type (5H), ‘ear attitude’ (5H) and ‘grain ventral furrow hair’ (6H). The positions of several of these genetic positions coincided with the previously known locations for these morphological characters, others such as the 1H position shown for seasonal growth habit were unexpected. Of particular interest was a region on chromosome 2H that was found strongly associated with a number of anthocyanin based DUS characters. They noted that the Mendelian locus ANTHOCYANINLESS 2 (ANT2) had been previously reported on chromosome 2HL based on studies involving biparental crosses. Similar mapping work, with a biparental population also genotyped with BOPA1 indicated that the map location of ANT2 coincided with the position identified in the association panel. Then they derived a composite phenotype with two character states: absence of anthocyanin coloration in all recorded tissues (awns, auricles and lemma nerves), or presence in one or more of these structures. GWAS of the composite phenotype (absence of anthocyanin coloration in all recorded tissues or presence in one or more of these structures) found the genetic interval controlling this trait to lie between 93.5 and 103. 7 cM on chromosome 2H, with the peak association (-log10 p = 51.7, marker 11_21175) at 96.8 cM.

Additional genetic markers were developed using co-linearity with rice chromosome 4 and Brachypodium (B. distachyon) chromosome 5, ultimately defining the ANT2 locus to within a 0.57 cM interval flanked the barley homologues of LOC_Os04g47110 and LOC_Os04g47020. These flanking markers were used to identify a minimum tiling path of BACs across the interval that were then sequenced. The 260 kb interval contained eleven genes, of which eight were located at collinear positions in one or more related cereal genomes. Three gene models were identified between the flanking markers, including a strong candidate gene that showed high homology to genes at the R/B loci that encode proteins containing a bHLH DNA-binding domain, that have previously found to control anthocyanin pigmentation in maize.

Sequencing a 4.6 kb interval across the candidate gene HvbHLH1 in a subset of 90 cultivars identified 69 polymorphisms arranged in 4 haplotypes, with haplotype 1 exclusive to ‘white’ varieties, while haplotypes 2-4 were associated with anthocyanin coloration in one or more tissues. The identified polymorphisms between the haplotype groups included eight synonymous and four non-synonymous variants, as well as a 16 bp deletion within exon 6 that results in truncation of the predicted protein upstream of the bHLH domain. Subsequent genotyping in the complete association panel established that the 16 bp deletion occurred in all cultivars lacking anthocyanin pigmentation, and not in cultivars in which anthocyanin is expressed in one or more tissues. Thus, GWAS for this Mendelian trait identified a region of the genome that with additional marker development could be reduced to only three genes, including a strong candidate gene that showed functional variation and was diagnostic for the trait (see Cockram et al. 2010 for further details).

GWAS for Simple Traits Identifies Epistatic Interactions

Cockram et al. (2008) identified two epistatic loci controlling vernalisation requirement by GWAS. The panel consisted of 429 spring and winter barley varieties and was genotyped with S-SAPs and SSRs together with markers based on gene specific amplicons. The genetics of vernalization requirement in barley is relatively well characterized being controlled predominantly by two major loci: VRN-H1 and VRN-H2 (von Zitzewitz et al. 2005). Spring alleles are thought to be due to deletions spanning putative cis-elements in VRN-H1 intron I, or to deletions of part or all of the genomic region carrying the VRN-H2 candidate genes. There is thus an epistatic relationship between the loci with winter barleys requiring winter alleles at both VRN-H1 and VRN-H2 potentially making their detection problematic in GWAS. However markers for both loci were found associated with winter habit in this panel with the use of genomic control (Cockram et al. 2008) as well as allowing for population structure in the analysis. This finding confirmed the results of previous detailed bi-parental mapping studies that had furnished the GWAS investigation with the markers targeting the functional polymorphisms at VRN-H1 and VRN-H2.

The lack of genomic marker coverage hampered the study of Cockram et al. (2008). Ramsay et al. (2011) used the BOPA1 and BOPA2 platforms to elucidate the control of another epistatic interaction that aligns with population sub-structure in barley; that underlying ear-row number. Barley possesses three single-flowered spikelets at each rachis node with the alternating triplets appearing opposite each other in two ranks thus forming six files of spikelets. When all three are fertile the ear has six rows of grains but if the two outer lateral spikelets are sterile then the ear is two–rowed. The presence of six rows is controlled principally by the cloned gene VRS1, on chromosome 2H (Komatsuda et al. 2007) that has been known for some time to be modified by the action of INT-C on chromosome 4H. In germplasm surveys, the vrs1.a allele in six-rowed barley cultivars is generally complemented by the Int-c.a allele and in two-rowed cultivars Vrs1.b is always complemented by int-c.b. The presence of int-c.b in six-rowed cultivars (i.e. vrs1.a, int-c.b) results in the development of smaller lateral spikelets (Lundqvist et al. 1997. In normal two-rowed (i.e. Vrs1.b) barley, int-c.b suppresses anther development in the lateral spikelets. In contrast, Int-c.a in two-rowed cultivars (i.e. Vrs1.b, Int-c.a) causes enlarged, partially male fertile, lateral spikelets.

Row type is indicative of a major population division in barley germplasm, though some cross breeding has occurred, in particular in the development of European winter-sown barleys. Despite this population stratification, association tests of row type in 190 barley cultivars with 2473 bi-allelic genome-wide SNPs revealed associations on chromosomes 1HL, 2HL and 4HS. The association of a SNP in a gene estimated to be 0.05 cM (seven genes) distal to VRS1 indicated that the peak on 2HL was caused by VRS1. This was confirmed by re-sequencing VRS1 across the mapping panel, finding complete association with causal vrs1.a alleles. Direct evidence for the correspondence between the association on 4HS with INT-C was again complicated by a lack of common markers with previous mapping studies and the inherent difficulty in phenotyping the environmentally sensitive intermedium trait in bi-parental populations (Lundqvist et al. 1997). Using rice gene content and order as a proxy, further characterization of the region was once again achieved by re-sequencing PCR amplicons derived from barley orthologues of the neighboring rice genes across the association panel. This showed that a significant level of association was maintained over a region of some twenty genes that included several strong candidate genes for INT-C, notably the barley orthologue of maize TEOSINTE BRANCHED 1 (ZmTB1). ZmTB1 is a domestication gene and member of the TCP gene family that encodes putative basic helix-loop-helix DNA-binding proteins and whose members are involved in the control of organ growth. Resequencing confirmed that HvTB1 contained the most significantly associated SNP and genetic mapping that placed it in the expected location. Definitive evidence that HvTB1 was INT-C was obtained by re-sequencing HvTB1 in a collection of 17 known INT-C mutants in a Vrs1.b (two-row) background. The GWAS approach thus enabled dissection of the epistatic control of row-type and high resolution mapping, and ultimately cloning of the interacting genes.

GWAS for Quantitative Traits

The use of GWAS to dissect the genetic control of quantitative traits is more complex than its use for simpler traits controlled by a limited number of major genes. There are evident limitations to the power of a GWAS to determine the loci underlying a quantitative trait depending on the size and nature of the panel used as well as the complexity of the genetic control of the trait. Simulations can give some guidance to the expected limitations of the power of a particular study (Cockram et al. 2010) as well as to the appropriateness of methodologies to allow for population structure. However, the use of a much higher density of markers and the direct relationships established between association and bi-parental studies revealed by sharing same genotyping platform have made such comparisons easier in recent studies. The functional validation of candidate genes underlying quantitative variation is more complicated than those under the control of monogenic or oligogenic traits where developmental or morphological consequences of functional genetic variation may already have been characterised through the use of mutant plant resources. Usually the knowledge of the genetic architecture of the trait in the germplasm under study is scarce, there is no reference in bi-parental populations and even when positional correspondence between bi- and multi-parent populations is observed, it is generally difficult to prove that they share the same underlying genetic determinants. The nature of the trait may hinder exploration and using rice or Brachypodium gene content and order as a proxy is difficult because the type of gene responsible for the trait is maybe unknown. The most robust associations for entering the validation pipeline can be prioritised by identification of the same associations in independent germplasm. Figure 10.2 shows how a significant height QTL on chromosome 3H detected in a spring barley association panel consisting of 650 lines with de novo height data is cross-validated in an independent dataset consisting of 230 spring lines using 15 years of historical data. The association on chromosome 3H is almost certainly due to the green revolution gene sdw1 (Jia et al. 2009) and is co-located with the sdw1 phenotype mapped in a mapping populations (Thomas et al. unpublished data; Malosetti et al. 2011), but other associations observed have not yet been characterised. Given the difficulties associated with validating associations with components of complex traits it is not surprising that there is little in the literature yet describing successes in this domain. However the authors are aware of several studies where components of complex traits have been resolved to gene level and validated using mutant resources (Jordi Comadran and colleagues—unpublished results).

Fig. 10.2
figure 2

Cross-validation of genome wide association (GWA) scans across independent germplasm sets genotyped with the same SNP platform. BOPA1 SNP loci with minimum allele frequencies > 10 % and missing data < 10 % were used for a GWAS using a kinship mixed model approach as implemented in Genstat v.14 (VSN International). TASSEL V3.0 was used to estimate the kinship matrix (K) from a subset of random markers covering the whole genome so that we did not over-estimate sub-population divergence. (a) Highly replicated height data collected from 200 elite 2 row spring cultivars over a period of ~20 years were analysed by GWAS. Several significant association peaks were detected but only chromosome 3H is shown. –log10 [fp values] are plotted following chromosomal order and may not reflect genetic distances. (b) Chromosome 3H scan for “de novo” height data collected on 650 2 row spring cultivars in one season. The top SNP (highlighted in the graphs with a circle) is tightly linked to barley green revolution gene sdw1 (Ramsay et al. unpublished data)

8 Future Prospects

Over the past several years we, and others, have successfully assembled the molecular tools, tested various analytical approaches and ‘tuned’ our choice of biological resources to effectively take advantage of genome wide association scans. Ultimately we chose to focus on exploiting variation in the relatively narrow 2-row spring barley genepool to take advantage of the limited population substructure, to reduce the number of segregating alleles at each locus, to facilitate generation of an efficient unbiased genotyping platform and to focus on contemporary germplasm that is still exploited for breeding in the public and private sectors. This latter choice in particular has allowed us to interact effectively with those involved in crop improvement and allowed easy transfer of resources and technologies into a domain that has real impact on determining the varieties that are grown in farmers’ fields. These choices together have allowed the isolation of major genes and genes controlling more complex traits. In future a significant issue remains over how we most effectively validate associations with components of highly complex traits such as yield and quality, and in such cases how the data is best exploited by the end user community. Thus, while as academics we are focused on using the information for gene identification and validation, we are also actively exploring how the phenotypic and molecular marker data can be integrated into a practical crop improvement program. Currently we are focusing on ‘Genomic Selection’ (GS—Meuwissen et al. 2001). A general view is that GS holds much promise for crop improvement but precisely how it will be implemented remains to be established. We conclude that, if establishing GWAS in barley effectively delivers the dual outcomes of facilitating gene isolation and providing the molecular and phenotypic datasets to establish Genomic Selection, then what we have learned will have been valuable and worthwhile.