Introduction

Common bean (Phaseolus vulgaris) is the third most important grain legume in the world and is the most important food crop among these, produced over an area of 18 million hectares with large regions of production in Latin America and Eastern and Southern Africa, where the crop is critical to nutritional security and income generation (Broughton et al. 2003). Cultivated common beans originated in two centers of diversity giving rise to two genepools: the Mesoamerican genepool from Central America and Mexico and the Andean genepool from the Andes mountains of South America (Singh et al. 1991a). The differences between Mesoamerican and Andean gene pools include seed size, plant morphology (Singh et al. 1991b), phaseolin (seed storage protein) patterns (Gepts et al. 1986) and alleles of various types of molecular markers including microsatellites (Blair et al. 2006a, 2007; Díaz and Blair 2006). Cultivated bean gene pools have further been divided into races according to morphological criteria and agro-ecological adaptation (Singh et al. 1991a, b) where the term ‘race’ is used to denote a group of related genotypes and where members of each race have distinctive and specific physiological, agronomic, biochemical and molecular characteristics and differ from other races in the allelic frequencies at specific isozyme or microsatellite loci (Singh et al. 1991c; Blair et al. 2007; Díaz and Blair 2006). The original six races identified by Singh et al. (1991a) were Durango, Jalisco and Mesoamerica in the Mesoamerican genepool; and Chile, Nueva Granada and Peru in the Andean genepool. The race concept for Mesoamerican common beans was further refined by Beebe et al. (2001) who provided molecular evidence for a seventh race, termed Guatemala and by Díaz and Blair (2006) and Blair et al. (2006a) who found the Durango and Jalisco races to be grouped together rather than as separate races.

Microsatellite markers based on simple sequence repeat loci have been particularly useful in studying population structure given their higher polymorphism level compared to other markers and better resolution in diversity studies (Akkaya et al. 1992; Pejic et al. 1998; Cho et al. 2000; Díaz and Blair 2006; Blair et al. 2007). Large-scale studies, with more than 200 genotypes at a time, initially were conducted principally in US or European institutions for cereal crops such as barley, Hordeum vulgare L. (Malysheva-Otto et al. 2006), maize, Zea mays L. (Liu et al. 2003), pearl millet, Pennisetum glaucum (L.) R. Br. (Oumar et al. 2008) and rice, Oryza sativa L. (Garris et al. 2005). More recently, the germplasm collections held in the CG centers for international agricultural research have been analyzed, mainly as part of the Generation challenge program on “Unlocking genetic diversity” (http://www.generationcp.org) and these large-scale studies are shedding light on population structure of the most important germplasm for the developing world (Thomson et al. 2007; Warburton et al. 2008). Microsatellite methods based on fluorescent labeling and automated band calling with precise, software-based allele detection are considered the most accurate way of genotyping with this type of marker (Diwan and Cregan 1997; Blair et al. 2002; McCouch et al. 1997). In addition, fluorescent labeling of microsatellites and organization of these into various dye color panels has advantages in allowing multiplexing during band separation, with advantages for high throughput genotyping and simultaneous analysis of multiple loci (Reed et al. 1994; Schwengel et al. 1994; Coburn et al. 2002; Tommasini et al. 2003). Multiplexing can also be carried out in the PCR amplification step with mixtures of the appropriate primers (Hayden et al. 2008). Current technology allows for capillary-based separation of microsatellite bands with four color panels of four individual markers labeled with different dyes evaluated through a single capillary with a separate size standard, occupying a fifth detection wavelength range, allowing for precise band size estimates (Coburn et al. 2002).

Association of phenotypic traits with microsatellite alleles has been found to be practical in several crop species, although SNP markers, sequence haplotypes or high density marker sets such as DART or AFLPs are usually preferred for association mapping (Flint-Garcia et al. 2005; Rostoks et al. 2006; Stick et al. 2006; Crossa et al. 2007). The number of markers used for association genetics varies although there is general agreement that more are required in out-crossing species or widely diverse germplasm samples where linkage disequilibrium is low when compared with the number of markers required for less diverse germplasm sets in inbreeding species where linkage disequilibrium is high (Buckler and Thornsberry 2002). Estimates of linkage disequilibrium are usually precursors to actual association analysis, which requires phenotypic trait evaluation in addition to a large genotypic data set. However, the usual process of constituting a germplasm panel for genotyping that is later phenotyped is inefficient. An alternative strategy is to genotype previously phenotyped collections such as those held in germplasm banks or assembled into core collections (Brown 1989). In any case, the phenotypic traits evaluated should be of high heritability to find associations that are not influenced by environmental variability (Flint-Garcia et al. 2005).

In common bean, the evaluation of microsatellite marker diversity has been useful for analysis of race structure as described in Blair et al. (2006a, 2007) and Díaz and Blair (2006); however, in these cases gel-based assays were used and this limited the number of genotypes that could be evaluated. Therefore, there has been a need for the analysis of larger and more comprehensive sets of germplasm such as core collections. In the case of common beans, two core collections exist one for 1,441 individuals at CIAT that has been extensively phenotyped (Beebe et al. 1997; Islam et al. 2002, 2004) and another prepared by the USDA, which has been partially phenotyped (Miklas et al. 1999). Correspondingly, there has been the need for the development of a high-throughput method for microsatellite diversity assessments using fluorescently labeled markers in common bean. To date, only a single study has been conducted with this type of marker in this crop species, evaluating only a limited number of genotypes with a small set of cDNA-based microsatellites (Masi et al. 2003). Meanwhile, no attempts have been made to perform association studies with microsatellites, although one study did attempt to associate specific RAPD markers with various traits in part of the CIAT core collection (Islam et al. 2004).

In this study, our overall goal was to determine population structure in a large-scale international collection of common bean accessions from the CIAT core collection using fluorescent microsatellites and conduct association analysis with seed phenotypic characteristics. Our specific objectives were (1) to assess genetic diversity in a group of 604 genotypes from the CIAT germplasm and core collections with microsatellite markers and phaseolin evaluation; (2) to test a set of fluorescent microsatellite markers for capillary-based genotyping based on previous polymorphism evaluation of a large set of cDNA (gene-based or coding) microsatellites and genomic (non-coding) microsatellites, (3) to determine if race structure was well defined in this core collection when compared with our previous studies and if diversity was similar in primary and secondary centers of diversity and (4) to conduct a test for association of seed size, length, width and height with the markers evaluated, comparing associations with phaseolin and with individual microsatellite loci in each subpopulation.

Materials and methods

Plant material

A total of 604 genotypes were evaluated in this study, including totals of 304 Mesoamerican genepool accessions and 300 Andean genepool accessions, all selected from the CIAT germplasm collection. Of these, all the Andean genotypes were from the core collection while the Mesoamerican genotypes included 23 genotypes from the general collection that had been previously evaluated by Díaz and Blair (2006). The majority of the accessions were from the primary center of diversity especially the countries of Mexico (179 genotypes) and Peru (180 genotypes), while the remainder were from Argentina (10), Brazil (26), Bolivia (11), Chile (8), Colombia (32), Costa Rica (16), Cuba (2), Dominican Republic (2), Ecuador (36), El Salvador (6), Germany (2), Guatemala (60), Haiti (9), Honduras (9), Nicaragua (6) and the United States (3). One accession each was from Australia, Burundi, France, Jamaica, Malawi, Rwanda and Uganda. The accessions had the following phaseolin alleles: S, Sb, Sd, B and M for the Mesoamerican genotypes; T, C, H and A for the Andean genotypes. Control genotypes included the Andean genotypes, Calima/G4494 and Chaucha Chuga/G19833, as well as the Mesoamerican genotypes, ICA Pijao/G5773 and Dorado/DOR364 (with common name and germplasm entry or advanced line name listed in each case). Seed samples for the core collection are maintained by the bean project at CIAT and were originally from the Genetic Resource Unit (http://isa.ciat.cgiar.org/urg/main.do). Selection from the CIAT core collection was based on the random sampling of genotypes evaluated by Beebe et al. (2000, 2001) while the core collection itself was carefully chosen to represent the broadest possible diversity including modern varieties, landraces and a few breeding lines which represent all agro-ecological zones where the crop is grown (Tohme et al. 1995; Beebe et al. 1997). Of the 1,441 genotypes in the core collection, 41% were evaluated in our study.

DNA extraction and phaseolin evaluation

DNA extraction involved germinating 10 seeds selected at random from each accession. The seed was scarified to ensure uniform germination in the laboratory and pre-germinated in darkness on germination paper. The first trifoliate leaves of 8-day old seedlings were collected and ground in liquid nitrogen for DNA extraction with the method of Afanador et al. (1993). DNA was re-suspended in TE buffer and DNA quality was evaluated on 0.8% agarose gels followed by quantification with Hoescht H 33258 dye on a Hoefer DyNA fluorometer (DNA Quant™ 200). DNA was diluted to 10 ng/ml for further experiments. Common bean is a predominantly self-pollinating species and the genotypes in the core collection are maintained as single seed types; therefore, this project did not expect to find high levels of observable heterozygosity. Phaseolin evaluations were as performed by Durán et al. (2005) with a method developed by the Genetic Resource Unit of CIAT for total seed protein.

Fluorescent microsatellite analysis

Microsatellite amplification used a fluorescent marker kit that we developed as part of this project and included microsatellites selected from those of Yu et al. (2000), Gaitán et al. (2002) and Blair et al. (2003). The selection of markers was based on the evaluation of 150 microsatellites previously tested for polymorphism and PCR conditions (Blair et al. 2006a, b). The kit included a total of nine-four-color marker panels for the analysis of a total of 36 individual microsatellite markers. These microsatellites were distributed across the genome and covered 9 of the 11 chromosomes of common bean genome according to map positions from Blair et al. (2003). One of the markers, BM188 detected two loci in separate size ranges. To sample both fast-evolving genomic sequences and slower evolving gene sequences, microsatellites were selected that had been designed and tested from non-gene coding as well as gene coding regions of the genome, termed genomic and gene-based microsatellites, respectively. The 20 genomic markers included AG, BM or GATS series markers developed by Gaitán et al. (2002) plus BMd56 from Blair et al. (2003). The 16 gene-based microsatellites used were mostly BMd series markers from Blair et al. (2003) or PV series markers from Yu et al. (2000). Forward primers for each of the microsatellites were 5′-end labeled with 6-FAM, NED, PET or TET fluorescent labels as indicated in the supplementary table and used with unlabeled reverse primers in PCR amplification. Amplification conditions were as recommended in Blair et al. (2006a) using a standard thermocycle of 95°C hot start for 3 min, followed by 28 cycles of 95°C denaturation for 40 s, 55°C annealing for 40 s and 72°C extension for 1 min. A 1-h extension at 72°C was used post-thermocycling. PCR reactions included 1× PCR buffer (10 mM of Tris–HCl pH 8.8, 50 mM of KCl, 0.1% of TritonX-100), 1.5 mM MgCl2, 0.2 mM dNTPs, 0.3 nM of each primer and 1 U of Taq polymerase (Promega). A total of 2 μl from each of the fluorescent PCR products corresponding to the appropriate markers were mixed into multiplex panels as listed. This mixture was diluted 1:1 with dH2O and 0.5 μl of the solution was then prepared in 9 μl of formamide with 0.06 μl of Genescan LIZ500 size standard and 0.44 μl of dH2O, which was then run on an ABI PRISM 3730 fragment analysis system (Applied Biosystems) at the Cornell University Biotechnology Resource Center.

Data analysis

Allele sizes (in base pairs) were estimated by comparing the fragment peaks with the internal size standard calculated with GeneMapper v. 3.7 software using the default method for band calling with microsatellites (cut-off value, plus A ratio/distance, stutter ratio/distance, signal level, heterozygote balance, peak morphology and pull up/cross talk peaks), the second order least squares size calling method and the expected repeat size. Electropherograms were also reviewed visually in Genemapper software to ensure that clear peaks were found for the expected marker sizes, that stutter peaks were not called and to assess that a proper selection of multiplex markers had been made.

Raw allele size calls were then binned to assign a whole integer allele value using the software program AlleloBin (http://www.icrisat.org/gt-bt/biometrics.htm) based on a least-square minimization algorithm of Idury and Cardon (1997), which transforms fragment sizes based on migration into a binned values with sizes in base pairs. Binned allele information was databased at http://gcpcr.grinfo.net/index.php and used for a principal coordinate analysis (PCoA) with the goal of determining sub-populations. This was based on the proportion of shared alleles estimated with the IML module of the software program Statistical Analysis Systems v. 9.1.3 (SAS Institute 1996) and on the formula PS = ∑PS i /n, where i each locus, n total number of loci and PS i proportion of shared alleles at locus i. PowerMarker v. 3.25 (Liu and Muse 2005) was then used to evaluate allele frequency, number of alleles, percentage of polymorphic alleles, expected heterozygosity (H e/Nei 87), observed heterozygosity (H o) and polymorphism information content (PIC). Relationships between sub-populations were graphed in three dimensions using software program NTSYSpc 2.1 (Rohlf 2002).

In addition, Wright statistics and molecular analysis of variance (AMOVA) were determined with Arlequin v. 3.11 (Excoffier 2007) and clusters were analyzed for phylogenetic relationships with PowerMarker using Nei‘s (1973) coefficient. A consensus tree was drawn based on 1,000 bootstraps with the subprogram CONSENSE in Phylip (Felsenstein 1993). Finally, the number of populations (K) was evaluated with both STRUCTURE (Pritchard et al. 2000), which determines a Q matrix of population relatedness and with the similar program InStruct (Gao et al. 2007), which is thought to provide a better estimate of population number in an inbreeding species such as common bean. Population structure was determined assuming an admixture model with K = 2 to K = 15 for the first of these programs, which was run locally and from K = 2 to K = 10 for the second program, which was run remotely at http://cbsuapps.tc.cornell.edu/InStruct.aspx. A correlation model for allele frequency and 100,000 burn-ins with 200,000 iterations was used for InStruct. The results of both programs were visualized with the software DISTRUCT (Rosenberg 2002.

Phenotypic data and association analysis

The following seed characteristics were measured on ten seeds of each genotype with a caliper with millimeter (mm) adjustment: seed length from end to end, and seed width and height from side to side at the hilum when the seed lays flat on a horizontal surface. In addition the one-hundred seed weight (100 SW) was measured in grams with an analytical balance. Seed color had already been scored on all accessions according to a CIAT developed scale based on commercial class color classifications (CIAT 1987) and photographs of the genotypes are available at http://isa.ciat.cgiar.org/urg/main.do. Phenotypic data was used in a structured association analysis with a mixed linear model, taking into account population substructure and races as determined by this study, with the software program Tassel v. 2.0 (http://www.maizegenetics.net/; Buckler et al. 2007). Significance of probability (P) values for association tests were determined with 10,000 permutations using the same software program.

Results

Allele number and genetic diversity

The total number of alleles identified in the entire study was 679 with an average of 18.4 alleles per locus with all the markers analyzed being polymorphic having from 2 to 76 alleles each (Table 1). The marker PV-at001 (confirmed through re-evaluation and visual inspection of electropherograms) showed the highest number of alleles while the next highest allele numbers were found for BM187 and BM200 with 54 and 53 alleles, respectively. Meanwhile, the next most polymorphic markers were BM137, BM143 and GATs91, and all of the above were genomic microsatellites. In contrast, the gene-based microsatellites were correspondingly less polymorphic, especially BMd51 (bi-allelic), BMd46 (tri-allelic) and BMd02, BMd17, BMd18 and PV-ag003 (each with 5 alleles). Among the least polymorphic genomic markers were GATs54 (bi-allelic), BM149 (tri-allelic) and BMd56 (4 alleles), followed by BM188A (9 alleles), AG01 (10 alleles) and BM205 (11 alleles). A total of 12 markers produced under ten alleles in the case of the gene-based markers while only 4 did so in the case of the genomic markers. Correspondingly, the average number of alleles per marker was higher for the genomic microsatellites (23.8) when compared with the gene-based microsatellites (11.3). It was notable that 12 microsatellites produced over 20 alleles each in the case of the genomic microsatellites while in the case of the gene-based microsatellites only one did. All the markers detected single loci except for one marker (BM188), which detected two loci distinguished by the pattern of stutter band amplification and size range with BM188A having alleles between 130 and 153 bp and BM188B having alleles between 160 and 200 bp. Genetic mapping has confirmed that the two loci are independent and on separate linkage groups (CIAT, unpublished results).

Table 1 Genetic diversity values for microsatellites evaluated across the 604 genotypes of common bean

Individual markers detected from 0.264 to 0.952 expected heterozygosity for genomic microsatellites and 0.007 to 0.973 for gene-based microsatellites. Observed heterozygosity was low, ranging from 0.000 to 0.297 and averaging 0.049 across all markers (0.055 for gene based and 0.045 for genomic markers). The markers with more observed heterozygosity were BMd15, PV-at003 (gene-based) and BM141 and BM156 (genomic) and observed heterozygosity could be explained by out-crossing and heterozygous individuals, by residual heterozygosity in breeding lines or by heterogeneity in the genebank accession, although most accessions were selected to be of a single seed type.

PCoA and race structure

The PCoA for the core collection genotypes is shown in Fig. 1 based on genetic similarity values from the proportion of shared alleles. The three-dimensional graphical representation allows the evaluation of population structure and geometric distances between all the genotypes in the study. The most evident subdivision was that of the two major gene pools of common bean, the Andean and the Mesoamerican, which are clearly distinguished as two separate clusters (Fig. 1a). Within each genepool, various subgroups were found, some of them corresponding to morphological races as described by Singh et al. (1991a). Five of the subgroups corresponded to divisions within the Mesoamerican genepool while four of the subgroups were found in the Andean genepool. The PCoA explained 85.7% of the variability for the germplasm set, with separation of the genepools in the first axis (64.3% of variability explained), separation of Mesoamerican races in the second axis (11.6%) and separation of the Andean races in both the third axis (8.3%) as well as the second axis. One advantage of the PCoA was that the relationships between each of the genotypes could be evaluated and this helps to visualize possible introgression between subgroups, races or genepools. The subgroups within each genepool and their relationship to morphological races are discussed separately below.

Fig. 1
figure 1

Three-dimensional principal coordinate analysis based on the analysis of: a 604 genotypes of common bean from the core germplasm collection explaining 83.89% of the variation and showing the position of control genotypes for each genepool, b Andean genepool genotypes divided into subgroups NG1 and NG2 (race Nueva Granada) and subgroups P1 and P2 (race Peru) and c Mesoamerican genepool genotypes divided into subgroups D–J1 and D and N-J2 (Durango–Jalisco complex), subgroup G (race Guatemala) and subgroups M1a, M1b and M2 (race Mesoamerica)

Subgroups or clusters from the 3D graph within the Mesoamerican genepool (Fig. 1b) represented the Mesoamerica race (subgroups M1 and M2), the Durango–Jalisco complex (subgroups D–J1 and D–J2) and a third group representing the Guatemala race (subgroup G). These results agree with those of Díaz and Blair (2006) where the Durango and Jalisco races are not substantially distinct from each other but do cluster together separately from the Mesoamerica race. As in this previous study, the Mesoamerica race was made up of small-seeded genotypes of type II, III and IV growth habits, while the Durango–Jalisco complex was made up of medium-seeded genotypes mostly with type III or IV growth habits (Table 2). Similar phaseolin alleles were found distributed among both races (S, Sb, Sd and B) and it was notable that while members of the Mesoamerica race were from many countries in the primary and secondary centers of diversity, the Durango–Jalisco genotypes were predominantly from Mexico with a few additional genotypes from Guatemala.

Table 2 Distribution among subgroups of seed size, seed color, seed pattern, growth habit and phaseolin in 604 common bean genotypes

Within race Mesoamerica, the subgroups M1 and M2 were closely related. Subgroup M1 was a compact group predominantly of small-seeded, black beans and some cream and red beans mostly from Central America (Costa Rica, Guatemala, Honduras, Nicaragua, El Salvador) but also Mexico and Brazil. Growth habits ranged from type II to a few type IV and the representative genotypes in this subrace were Rio Tibagi, Mulatillo and the control genotypes DOR364 and ICA Pijao. Subgroup M2 was similar to M1 in geographic origin and in having small seed size but most of its genotypes had type III growth habit and a greater proportion were red seeded when compared with M1. One genotype in this group had a wild form of phaseolin (M) while all others within race Mesoamerica were cultivated forms (S, Sd and Sb).

The two groups identified within the Durango–Jalisco complex were similar in being predominantly type III or IV growth habit, although D–J2 had slightly more type III genotypes than D–J1, which had more type II genotypes. The largest difference between the two groups was in seed size with D–J1 having more medium and large-seeded genotypes than D–J2. Seed colors were similar in both subgroups with a predominance of lighter shades (creams and yellows) in D–J1. As expected, the majority of the genotypes in both groups were from Mexico, however, in the case of D–J2 some were found outside of Mexico and in the case of D–J1, one genotype was from Africa. In addition to S type phaseolins, both subgroups contained wild type phaseolins M, M1 and M15.

The final subgroup within the Mesoamerican genepool consisted of 61 genotypes representing race Guatemala that were predominantly type IV climbing or type III semi-climbing beans and mostly from Guatemala or the Mexican state of Chiapas, which are the two places where the race was predicted to originate according to Beebe et al. (2000). It was notable that while members of the Mesoamerica race were from many countries in the primary and secondary centers of diversity and the Durango–Jalisco genotypes were predominantly from central Mexico, the Guatemala race genotypes were predominantly from this specific region. Seed types for this cluster were variable with most being medium-sized seed with black or cream seed colors. Phaseolin types included S, Sb and finally B, which is an allele that is associated with northern South American genotypes.

The Andean genepool showed four subgroups in the principal correspondence analysis corresponding to two subgroups in race Nueva Granada (NG1 and NG2) and two subgroups in race Peru (P1 and P2). The races and subgroups were distinguished by the third axis of the PCoA (Fig. 1c) with NG2 and P1 intermediate between NG1 and P2 with these two subgroups at the extremes of the distribution of Andean beans and representing the most unique genotypes of each race. In addition, for NG1, NG2 and P1 a certain amount of between-genepool introgression was evident with genotypes that were intermediate between these Andean subgroups and the Mesoamerican genepool. Among the Nueva Granada subgroups, both clusters had medium to large-seeded genotypes of diverse color classes, however, NG1 was predominantly red or pink and mottled, while in the NG2 group cream mottled, yellow and large red or purple beans were more common. The NG1 subgroup was mostly bush beans with the majority having determinate type I growth habit and smaller numbers of type II or type III beans with few type IV climbing beans. The NG2 subgroup meanwhile had more type III and IV growth habit climbing or semi-climbing beans. In terms of geographic origin, subgroup NG1 individuals were from a range of countries from the primary center of diversity (Argentina, Colombia, Ecuador and Peru) and from secondary centers of diversity (Brazil, Mexico and the Caribbean countries of Dominican Republic and Haiti); while subgroup NG2 individuals were mostly from the Andean region.

Meanwhile, among the race Peru subgroups, P1 contained a mix of climbing beans and bush beans of type I, II and III growth habit that were from Colombia, Ecuador and Peru including the Andean control genotype Chaucha Chuga (G19833), while P2 was almost exclusively made up of climbing beans from Peru alone. Seed size and color in race Peru was variable with more white, cream and yellow beans in this race when compared with Nueva Granada. The race also included all of the popping beans from our study (a.k.a ñuñas as they are referred to in the Andes), most of which were clustered with subgroup P2. Inter-genepool introgression was higher for the NG2 subgroup as evidenced by more Mesoamerican phaseolin alleles than among the NG1 or the Peru race individuals. Genotypes showing introgression between the genepools tended to have smaller seed than the other Andean genotypes and in the PCoA were located between the major genepool clusters. Given our interest in classifying the genotypes into races and subgroups we did not analyze the genotypes showing introgression separately.

Population structure and analyses of molecular variance

Population structure analysis using InStruct (Gao et al. 2007) confirmed the basic divisions found in the PCoA. Figure 2 shows subfigures for various K values up to the number of populations detected in PCoA since each level of subdivision was instructive. For example, K = 2 showed the basic division between Andean and Mesoamerican genepool with some Andean admixture in all the Mesoamerican subgroups and some Mesoamerican admixture in all the Andean subgroups except for P2. K = 3 showed an additional division within the Mesoamerican genepool for the Durango–Jalisco group and the Mesoamerica race but with admixture occurring in subgroup D–J2. K = 4 showed the separation of Nueva Granada and Peru race genotypes but with admixture in the NG2 and P1 subgroups. At K = 5, race Guatemala appears as a separate subpopulation but with substantial admixture from race Mesoamerica and the Durango–Jalisco complex; while subgroup D–J2 has admixture from race G and race M. The evaluation at K = 9 showed some separation of subgroups although admixture was prevalent between subgroups within each individual race especially within the Andean genepool. It was notable that subgroup NG2 separated from NG1 but was associated with the first part of subgroup P1.

Fig. 2
figure 2

Structure analysis with K = 2 to K = 5 and at K = 9 populations for the core collection of common bean showing the sub-divisions within each race. Sub-group abbreviations are D–J1 and D–J2 (Durango–Jalisco complex), G (race Guatemala) and M1 and M2 (race Mesoamerica) within the Mesoamerican genepool; and subgroups NG1 and NG2 (race Nueva Granada), and P1 and P2 (race Peru) within the Andean genepool

AMOVAs were conducted to determine the variation explained by genepools, races and subgroups as well as to compare variability in the primary versus secondary centers of diversity. Table 3 shows that more variability could be assigned to genepools and races when compared with subgroups but that substantial variability remained within each subpopulation. The comparison between races within genepools were significant in all cases (data not shown), however, more variability was explained by the contrast of races Guatemala, Mesoamerica and the Durango–Jalisco complex (17.2%) than by the contrast of races Nueva Granada and Peru (8.9%). In addition, the comparison between the primary and secondary centers of diversity was also significant (P < 0.001) but only explained 6.6% of variability (equivalent to F ST of 0.066) with 87.6% of variability remaining among individuals within the primary and secondary centers and 5.7% within individuals. As expected for an inbreeding species such as common bean, the subpopulations showed a deficit in heterozygotes and were therefore in Hardy–Weinberg disequilibrium. The within population fixation indices (F IS) were 0.915 to 0.939 across subpopulations but were higher in the Andean genepool (0.946) than in the Mesoamerican genepool (0.839).

Table 3 Molecular analyses of variance (AMOVA) for the comparison of primary versus secondary centers of diversity and for separation of genepools, races and subgroups

Genetic differentiation and diversity parameters for subgroups

Genetic differentiation (F ST) values between subgroups within genepools were generally low ranging from 0.028 to 0.241 among the Andean subgroup comparisons and from 0.044 to 0.268 among the Mesoamerican subgroup comparisons (Table 4). Meanwhile, genetic differentiation was high when comparing subgroups between genepools. The lower genetic differentiation between the Guatemala race and D–J2 may indicate shared ancestry for these two groups; however, this subgroup also was closely related to the two Mesoamerica race subgroups. In the Andean genepool, subgroup NG2 was closely related to both Peru race subgroups. The software program PowerMarker was used to carry out a phylogenetic analysis with the neighbor-joining tree represented as part of Table 4 and showing placement of the Guatemala race relative to the other Mesoamerican and Andean races. In this dendogram, race Guatemala is very distinct while the subgroups within each race cluster together (e.g. M1 with M2, NG1 with NG2 and P1 with P2).

Table 4 Genetic differentiation based on F ST values and phylogenetic analysis for subgroups identified among 604 genotypes of common bean

Genetic diversity parameters for the subgroups found within the principal correspondence analysis are shown in Table 5. The gene diversity values (He) were slightly higher for the Mesoamerican genepool subgroups compared to the Andean genepool subgroups; while within the Andean subgroups, NG2 and P1 were the most diverse and within the Mesoamerican group, subgroups D–J2 and race G were the most diverse. Diversity values were not correlated with the number of genotypes in the subgroup with D–J2 having fewer genotypes but the highest diversity values. Observed heterozygosity within the subgroups averaged 0.049 being very low for M1, M2, NG1, NG2, P1 and P2 (all below 0.050) and slightly higher for race D–J1 (0.058), D–J2 (0.0102) and race G (0.096), which may result from the climbing beans prevalent in these last three subgroups as climbing beans often have higher out-crossing rates (this laboratory, unpublished data). Upon comparing the allele number and expected heterozygosity values for each microsatellite in each subgroup and genepool, we found that genomic microsatellites tended to be more variable in Andean subgroups while gene-based microsatellites were more variable in Mesoamerican subgroups, although this was partly related to the number of genotypes in each subgroup. For example, the subgroup NG1 was low in number of genotypes and in allele number. Among the genomic microsatellites, some markers were more powerful than others in defined subgroups (e.g. BM141, BM156 and BM187 for Andean beans and BM175 and BM200 for Mesoamerican beans) while other markers functioned equally across all subgroups and both genepools (e.g. BM137 and BM143). The same can be said for specific gene-based markers, so for example BMd01 was more useful in the Andean genepool than in the Mesoamerican genepool and PV-at001 was more useful across all subgroups. Another interesting observation was that phaseolin was approximately as diverse in number of alleles/patterns as the average for all of the microsatellites considering both genomic and gene-based markers for each of the subgroups, the one exception being higher average number of phaseolin alleles than microsatellite alleles in the D–J2 group. Expected heterozygosity was also higher for the phaseolin locus than for the average of the microsatellite loci in each subgroup.

Table 5 Overall observed heterozygosity per subgroup and number of alleles and expected heterozygosity found for genomic and gene-based microsatellite markers and for phaseolin patterns across each subgroup found within the collection of 604 common bean genotypes

Association of seed phenotypes and marker alleles

Association of marker loci with seed characteristics was evaluated across the genome with the justification that phaseolin and other regions of the genomes have been associated with seed size in several previous quantitative trait loci (QTL) studies. Our interest was in testing associations of both the phaseolin locus (Phs) and microsatellite loci with seed characteristics to test whether markers on linkage group b07, where Phs is located, were more likely to be associated than markers on other linkage groups and to confirm associations identified in previous studies of QTL for these characteristics. To control for population structure, we performed association analyses independently for each race. As seen in Table 6, the most highly significant (P < 0.001) associations found were BMd8 (linkage group b04) for seed weight in races of both genepools (Mesoamerica, Nueva Granada and Peru; BM183 (b07) for seed weight, width and length also in both genepools; BM200 (b01) for seed weight in the Durango–Jalisco race; BM156 (linkage group b02), AG01 (b03), BMd16 (b04), Phs and BM205 (b07) for seed characteristics in races of the Mesoamerican genepool and BM143 and PV-cct001 (b02), PV-at001 (b04) and BM187 (b06) for seed characteristics in races of the Andean genepool. It was notable that Durango–Jalisco presented only one highly significant association while race Guatemala had associations only at P < 0.05, but not at the more significant level.

Table 6 Associations between microsatellite or phaseolin alleles and seed size traits in common bean races according to population structure analysis

The average seed weight and size were smallest in the Mesoamerica race, intermediate in the Durango–Jalisco complex, and highest in the Andean races as shown at the bottom of Table 6. Interestingly, while accessions from race Peru had heavier seeds on average, these seeds were shorter on average than race Nueva Granada accessions. The difference in weight, therefore, was due to the wider and higher seed dimensions that are found in the race Peru accessions, which typically were rounded while Nueva Granada races were elongated or kidney shaped. To statistically evaluate the differences in variability for seed characteristics in each of the races a Levene’s test was used to determine if variances were equal or not in these different subpopulations. Variances were unequal (P < 0.0001) for seed weight, length and height but were equal for seed width (P < 0.05). The most diversity for seed length was found in races Nueva Granada and Peru, while seed weight was very diverse for race Peru. Mesoamerican races had less diversity for seed weight, height and length than Andean races as reflected in the lower standard deviations.

Discussion

This study describes one of the first large-scale genotyping exercises for common bean using a core collection and fluorescent microsatellite marker technology. Core collections are considered valuable ways of analyzing phenotypic diversity (Brown 1989) but generally have not been genotyped before due to a lack of high throughput marker technologies. Therefore, in this project our initial step was to develop the fluorescent marker set needed for large-scale genotyping. Before starting this project, fluorescent microsatellites had not been widely used for common bean as only one publication had attempted this technology for this crop (Masi et al. 2002). We decided to use fluorescent microsatellites and automated detection on a capillary system since this technology allows many samples to be evaluated in a uniform manner and gives more accurate allele calls than gel-based readings of microsatellites.

Fluorescent microsatellite evaluation uncovered a larger number or alleles than in our previous studies of common bean genetic diversity with microsatellites (Gomez et al. 2004; Blair et al. 2007; Díaz and Blair 2006; Zhang et al. 2008), however, a direct comparison is difficult due to the different number and source of genotypes included in each study. Allele richness was also higher than in the previous studies, with a Nei’s index for the entire set of genotypes of 0.643, which may reflect the large number of unique alleles. These values were even higher for the genomic (non-coding) microsatellites analyzed alone (0.734) when compared with the gene-based microsatellites analyzed alone (0.523), which agrees with our previous study comparing a large number of silver-stained markers from both of these categories (Blair et al. 2006a). It was notable that the AT motif-based marker was more polymorphic than GA- and GAT-based microsatellites and that overall fluorescent marker technology was highly accurate in calling alleles with small size differences (Diwan and Cregan 1997).

For the diversity assessment carried out in the PCoA, the results from fluorescent microsatellite evaluation agreed well with previous evaluations of common bean diversity with smaller sets of germplasm and non-fluorescent markers (Blair et al. 2006a, 2007; Díaz and Blair 2006). In turn, these results augment the analysis done by Beebe et al. (2001, 2002) with anonymous, dominant RAPD markers used to characterize the Mesoamerican section of this same core collection. We found the proportion of shared alleles to be a useful measure of genetic similarity, which was preferable to Dice similarity coefficient for a large dataset such as ours where non-amplification occurred for 4.1% of the total genotype × marker combinations. The reproducibility of fluorescent microsatellites as precisely read, single copy, co-dominant markers is advantageous over other types of markers, but some non-amplification occurred, perhaps, due to the small reaction volumes used in PCR or to the threshold sensitivity of the detection system.

As in our previous studies, using gel-based assays of microsatellites (Blair et al. 2007; Díaz and Blair 2006), race structure of common bean was doubly dichotomous with two large groups in each of two genepools, namely the Durango–Jalisco complex and the Mesoamerica race in the Mesoamerican genepool and the Nueva Granada and Peru races in the Andean genepool. However, in addition, approximately 10% of the genotypes clustered into a very distinct group of climbing beans all from Guatemala or southern Mexico that we defined as race Guatemala. These results confirm the existence of race Guatemala as an additional race within the Mesoamerican genepool validating the results with RAPD markers from Beebe et al. (2000) and expanding on the microsatellite analysis from Díaz and Blair (2006). Furthermore, these results are useful in better defining race Guatemala, since the initial differentiation of the race by Beebe et al. (2000) was based on a small number of genotypes (12) while here we identified 61 race Guatemala genotypes. Meanwhile, race Guatemala may have been difficult to detect in RFLP or RAPD analyses such as those by Becerra-Velazquez and Gepts (1994) or Johns et al. (1997) because it is less frequent in random germplasm collections and therefore may not have been sampled in these previous studies. Within the Andean genepool, the Chile race could not be identified here, perhaps, due to reduced sampling of germplasm from the southern Andes of South America or due to the fact that this race is not highly distinct from race Peru as was found in our previous study of a smaller group of genotypes (Blair et al. 2007). Specific studies with Chilean genotypes are needed to define this race further.

Another observation was that secondary center germplasm was found to represent part of the diversity found in the primary centers. As a result, genetic diversity in this study was higher in the primary centers than in the secondary centers of diversity. This could be explained, perhaps, due to the limited gene flow out of the primary center for some of the races or due to more limited sampling of secondary center diversity. The races that appear to be most extensive outside the center of diversity are races Mesoamerica and Nueva Granada, which agrees with observations made by Singh et al. (1991b). Further analysis and integration of SSR datasets from this study and from previous evaluations of dry beans with primary center germplasm or race standards Díaz and Blair 2006; Blair et al. 2007) or with secondary center germplasm from Central America (Gomez et al. 2004), China (Zhang et al. 2008) or Southern Europe (Lioi et al. 2005; Maras et al. 2006; Sicard et al. 2005) will be useful for an even better understanding of common bean race structure and genetic diversity. Another observation from our study was that the genepools had equivalent amounts of SSR diversity, while in a previous study we predicted more diversity in the Andean genepool than in the Mesoamerican genepool (Blair et al. 2006a).

This study also helped validate subgrouping within each of the races, which can be considered as subraces. For example, the subgroups within the Mesoamerica race largely corresponded to the subraces observed by Beebe et al. (2000) who predicted the existence of groups M1 and M2; however, the subdivisions of the Durango–Jalisco group did not correspond directly to the subgroups identified by Beebe et al. (2000) but did correspond well to those of Díaz and Blair (2006). Meanwhile, we did not identify any subdivisions within race Guatemala in contrast to the results of Beebe et al. (2000) even though we included four of the same genotypes as this previous study with two from each of the subdivisions identified by them. Subgrouping within the Andean genepool agreed with those of Blair et al. (2007) but much greater admixture was found among the groups form races Nueva Granada and Peru. Growth habit was found to not be a defining characteristic of each of the races or subraces. As was found in Blair et al. (2007), genotypes with climbing growth habit were found within the Nueva Granada subgroups and genotypes with bush bean growth habit were found within the Peru race subgroups, contrasting with the original race morphological description from Singh et al. (1991b).

Meanwhile, phaseolin pattern that was useful for genepool identification was found to be only partially predictive of race and subrace/subgroup divisions and mostly in the Andean genepool, where a diverse set of phaseolin patterns was observed, although phaseolin ‘T’ and then to lesser extent ‘H’ were most common. Interestingly, the proportion of these two patterns varied in each of the races, with phaseolin ‘T’ more prevalent in race Nueva Granada (74.8%) than in race Peru (60.2%) and ‘H’ higher in race Peru (18.8%) than in race Nueva Granada (6.9%). Meanwhile in the Mesoamerican geneopool, all the races and subgroups had the same phaseolin alleles (S, Sb, Sd and B) so microsatellite allele frequency would be a better method of separating races within this genepool.

The finding of similar levels of diversity in both genepools but varying diversity within subgroups contrast with our previous studies with non-fluorescently labeled microsatellite markers (Blair et al. 2006a, 2007; Díaz and Blair 2006) where Andean diversity was higher than Mesoamerican diversity and where diversity was equivalent in most subgroups. This may have been due to the evaluation of more diverse genotypes from race Guatemala and the Durango–Jalisco complex than in these previous studies. The similarity in genetic diversity between genepools could also be explained by the discovery of higher diversity for genomic microsatellites in Andean subgroups and contrastingly the higher diversity of cDNA-based microsatellites in Mesoamerican subgroups. In our previous analysis with microsatellite markers, we used fewer gene-based markers when compared with genomic markers (Blair et al. 2006a) so this may also explain the difference between studies. Higher diversity within the Andean genepool for genomic microsatellites may also be due to ascertainment bias since an Andean genotypes was used to develop the genomic markers (Gaitán et al. 2002). However, overall the results indicate the value of using microsatellites from several sources to evaluate genetic diversity in the crop and by extension the utility of both types of markers for association analysis, as discussed below.

Association analysis for microsatellite and phaseolin alleles with seed size or shape was performed to compare the effectiveness of association mapping with previous reports of linkage mapping and QTL analysis for these traits in bi-parental populations. Some of the most highly significant marker × seed weight associations (P < 0.0001) agreed with previously reported QTL on linkage groups b02, b03, b07 and b11 in wild × cultivated backcross and recombinant inbred line (RIL) populations (Blair et al. 2006b; Koinange et al. 1996) as well as in inter and intra-genepoool RIL populations (Park et al. 2000; Tar’an et al. 2002). The phaseolin locus, Phs, was found to be associated with seed weight in the Mesoamerica race agreeing with results from the first of these studies and appears to be especially important when considering wild bean introgression. Additional associations at P < 0.01 were found for phaseolin alleles and seed width in race Durango, for seed width or length in race Nueva Granada and for seed length in race Peru but none were found for race Guatemala. The association of BM143 and several other markers on linkage group b02 with seed weight agrees with results for a seed size QTL from Blair et al. (2006b) and may highlight this region of the genome as important in determining this characteristic. We decided to evaluate seed size characteristics because these are among the most important traits of common bean and a major determinant of commercial acceptability of traditional or improved varieties (Park et al. 2000; Voysest et al. 1994) as well as being highly heritable (Motto et al. 1978). Furthermore, bean seed weight has also been correlated with yield potential and with various yield components (Terán and Singh 2002) and varieties of dry beans that do not meet minimum seed weight criteria can be rejected by the marketplace while in snap beans, varieties with elongated seed shape or smaller seed size are preferred as they are correlated with pod quality (Blair et al. 2006b).

This study found that within each molecularly defined race of dry bean, there was variability for seed size and shape and that the range between the smallest and largest cultivated varieties for seed weight within the Mesoamerica race represented part of the sixfold difference found for all cultivated beans. When comparing wild and cultivated beans this difference is amplified to over 20–25-fold difference in seed weight (Acosta-Gallegos et al. 2007). Furthermore, seed size can influence and be influenced by the concentration of macronutrients such as protein, phytate and starch levels; and the proportion of seed coat to endosperm is directly proportional in most cases to seed size and shape with resulting influences on tannin levels and mineral concentration, since these are differentially found in seed coat tissues (Ariza-Nieto et al. 2007). Therefore, the markers we identified as being associated with seed weight, length, width and height can be important for the understanding or marker-assisted selection of this trait, especially in validating results from QTL studies, which only use a few genotypes as parents. In addition, the comprehensive germplasm set that was evaluated here and that included all races of the crop could be useful for discovering signatures of domestication bottlenecks for seed size traits where seed weight was a major characteristic emphasized by human selection (Papa and Gepts 2003).

In conclusion, the results of this large scale genotyping of common bean using fluorescent microsatellite markers will be useful as a baseline for further genotypic characterization of this and other collections, especially since this is the first integrated analysis of genotypes from primary and secondary centers of diversity and from both major gene pools of common bean using SSR markers. Overall, race structure was similar to previous studies but the current analysis was more definitive due to the larger number of genotypes analyzed and highlighted the importance of races Nueva Granada and Peru in the Andean genepool and Mesoamerica, Durango–Jalisco and Guatemala in the Mesoamerican genepool. The core collection analyzed here is unique in representing all major agro-ecologies where common bean is grown based on a geographic information system analysis that took into account climatic parameters and origin and the microsatellites we used cover most of the genome. As such, these results can be used to complement diversity assessments of national collections by providing the global population structure and whole-genome genotype in which to integrate specific collections and markers. In addition, a selection of the genotypes from the core collection with adaptation to tropical environments and lack of photoperiod sensitivity has been constituted into a reference collection that will be used to assay diversity at individual loci and for further phenotyping and association mapping.