Introduction

Maize is one of the most important crops in the world, as well as in China. Identification of genetic regions that are responsible for agronomically important traits is of fundamental significance for maize improvement. In the past several decades, linkage mapping has been extensively utilized in genetic dissection of simple or complex traits in maize. Recently, association mapping, which has several advantages over traditional linkage analysis (Kraakman et al. 2004; Flint-Garcia et al. 2005), has been proved to be an effective approach to connecting structural genomics and phenomics (Thornsberry et al. 2001).

Association mapping has been used extensively in human genetics (Corder et al. 1994; Templeton 1995). It was first introduced into plant genetics in 2001 (Thornsberry et al. 2001) mainly due to little information on the population structure and linkage disequilibrium (LD) pattern in plants (Flint-Garcia et al. 2003). Since many important crops have a long and complex domestication and breeding history, together with the limited gene flow in most wild plants, many crop populations exist as complex population structures (Sharbel et al. 2000; Flint-Garcia et al. 2003). When performing association analysis based on these populations without considering the effects of population structure, spurious association between genotype and phenotype variation may be detected because of the unequal allele frequency distribution between subgroups (Knowler et al. 1988). This has also been verified in maize (Andersen et al. 2005; Camus-Kulandaivelu et al. 2006; Wilson et al. 2004). Recently, with the development of statistics, independent markers that are distributed through whole genome are successfully used to detect population structures (Pritchard et al. 2000a, b). The resolution of association mapping depends on the extent and distribution of LD across the genome within a given population (Remington et al. 2001). LD is generally dependent on the history of the population, but other factors such as population structure, selection, mutation, relatedness, and genetic drift would also cause LD. However, among all of these factors, LD caused by linkage is the most significant importance for association mapping (Stich et al. 2005).

In our previous study, a mini core set of maize inbred lines (94 accessions) was defined to represent the genetic diversity of Chinese maize inbred lines (Li et al. 2004; Yu et al. 2007). Together with B73, a total of 95 inbred lines were used as the association mapping population for further research. However, little information of genetic diversity, population structure, and LD is known for the association mapping panel until now.

The objectives of our research were to (1) assess the genetic diversity of our association mapping population; (2) investigate the population structure among the inbred lines; (3) detect the extent and distribution of LD between SSR loci pairs.

Materials and methods

Plant materials

In our previous study, 288 maize inbred lines including 242 inbred lines of the core collection were established in China (Li et al. 2004) and some elite lines used in recent years in breeding programs were genotyped for genetic diversity at 49 simple sequence repeat (SSR) loci. These markers which were publicly available (http://www.maizegdb.org) covered the maize genome. With the help of SSR fingerprinting, a mini core set of 94 inbred lines representing 87% of the SSR allelic diversity of the 288 inbred lines was defined (Yu et al. 2007). They are genetically diverse but mainly of Chinese origin. These 94 inbred lines together with B73 constituted the association mapping population used in the present study. The 95 inbred lines and their pedigrees or sources are listed in Table 1.

Table 1 List of the 95 inbred lines used in this study

SSR genotyping

Genomic DNA was extracted from the leaves of 1-month-old maize seedlings according to the CTAB procedure (Saghai-Maroof et al. 1984). A total of 145 SSRs loci, randomly distributed across the genome, were used to genotype the mini core set of inbred lines. Among them, dinucleotide, trinucleotide, tetranucleotide, pentanucleotide and hexanucleotide SSRs accounted for 30.34, 37.24, 24.13, 5.52 and 2.76%, respectively. Most of SSR repeat motifs and sequences were obtained from MaizeGDB (http://www.maizegdb.org).

TP-M13 method was performed in our analysis, in which three primers, i.e., a sequence-specific forward primer, a sequence-specific reverse primer with M13 tail (5′-CACGACGTTGTAAAACGAC-3′) at the 5′ end, and a universal fluorescent-labeled M13 primer, were used (Schuelke 2000). The fluorescent dyes used in the analysis included FAM, VIC, PET and NED. PCR products were size separated on an ABI Prism 3700 DNA Sequencer (Perkin Elmer Biotechnologies, Foster City, CA, USA) and were classified into specific alleles by GeneTyper 2.1 software (Perkin Elmer/Applied Biosystems).

Genetic diversity analysis

PowerMarker V3.25 (Liu and Muse 2005) was used to calculate the summary statistics including allele number, allele frequency, gene diversity (or expected heterozygosity) and PIC. In addition, line-specific alleles and rare alleles (with frequency <5%) were also calculated. Gene diversity (D) was defined as the probability of two randomly chosen alleles from the population and calculated at each locus as.

$$ \hat{D}_{l} = (1 - \sum\limits_{u = 1}^{k} {\tilde{p}_{lu} ^{2} } )/(1 + \frac{1 + f}{n}) $$

Where p lu is the frequency of the uth allele, f is the inbreeding coefficient, n is the sample size. Polymorphism information content (PIC) (Botstein et al. 1980) was estimated as.

$$ {\widehat{ PIC}}_{l} = 1 - \sum\limits_{u = 1}^{k }{\tilde{p}_{lu}^{2} } - \sum\limits_{u = 1}^{k - 1} {\sum\limits_{v = u + 1}^{k} {2\tilde{p}_{lu}^{2} \tilde{p}_{lv}^{2} } } $$

Where p lu is the frequency of the uth allele, and p lv is the frequency of the vth allele.

In addition, the relationship of the total number of alleles with the sample size was also investigated. Re-sampling was repeated 1,000 times and the results were averaged for each size of sample.

Population structure analysis

To evaluate the population structure of the association mapping population, software package STRUCTURE 2.1 (Pritchard et al. 2000a, b) was employed to subdivide inbred lines into genetic subgroups. The number of subgroups (K) was set from 2 to 10. For each K, three runs were performed separately. And the burn-in length and iterations were all set to 500,000. Lines with membership probabilities ≥0.8 were assigned to the corresponding subgroups and lines with membership probabilities <0.8 were assigned to a “mixed” subgroup.

Allele number, gene diversity, PIC of each subgroup and subgroup-specific alleles were calculated. A re-sampling strategy was also used to obtain genetic diversity of each subgroup. The same number of samples equal to the number of samples of the smallest subgroup was selected randomly from the larger subgroups. The procedure was repeated 1,000 times and the results were averaged.

Linkage disequilibrium estimation

A permutation version of Fish’s exact test in PowerMarker V3.25 (Liu and Muse 2005) was used to calculate the extent of LD (r 2) between SSR pairs at = 0.01 in the entire population and each subgroups. As sample size affects the statistic power of LD test, we used a re-sampling strategy to obtain comparable estimate. That is, random samples in each subgroup with the same size were drawn from the entire samples, and the expected percentage of SSR pairs in significant LD was calculated. The procedure was repeated 50 times and the resulting estimates were averaged.

In addition, the ratio of the percentage of linked to unlinked SSR pairs in significant LD was also calculated. SSR loci which were located on the same chromosome were defined as linked loci, and SSR loci located on different chromosomes were defined as unlinked loci.

Results

Genetic diversity

A total of 145 SSR loci, randomly distributed across the genome, were used to evaluate the genetic diversity of the mini core set of inbred lines. All of the 145 SSR loci were polymorphic across the 95 inbred lines and a total of 1,365 alleles were detected (Table 2 and Supplementary material). The average number of alleles per locus was 9.4, ranging from 2 to 38. The average genetic diversity was 0.6831 with a range of 0.2921–0.9489. In addition, the average PIC value was 0.6439 with a range of 0.2555–0.9465.

Table 2 Summary statistics for the 145 SSR loci used in the present study

Of the 145 SSR loci, 44 were dinucleotide SSRs and the others were longer-repeat SSRs. The results showed that the allele number, the gene diversity, and the PIC were not equal among different types of SSR loci (Table 3). Dinucleotide SSRs had more alleles, higher gene diversity, and higher PIC than other longer-repeat SSR loci.

Table 3 Summary statistics for different types of SSR loci

Among the 1,365 alleles, 320 private alleles (23.44%) were found only in one of the 95 inbred lines. Frequencies of most alleles were low, and rare alleles with frequency of less than 5% accounted for 55.75% (Fig. 1). In order to clarify the relationship of the allele number with the sample size, a re-sampling strategy was used to select different number of samples (2–95) from the 95 inbred lines for 1,000 times. Then, alleles numbers of 1,000 times for a given sample size were averaged. The results showed that 65 random samples could capture 90% of the total alleles in the entire sample (Fig. 2).

Fig. 1
figure 1

Distribution of allele frequencies for the 1,365 alleles detected in the study

Fig. 2
figure 2

Plot of the expected number of alleles in samples of different sizes

Population structure

In order to understand the genetic structure of the association mapping population, a model-based approach in the STRUCTURE software was used to subdivide each inbred line to the corresponding subgroup. STRUCTURE software was run for the number of fixed subgroups K from 2 to 10, and three runs were performed for each K. As the STRUCTURE software overestimates the number of subgroups for inbred lines (Pritchard and Wen 2004), and it is difficult to choose the “correct” K from the Ln probability of data, Ln P(D) (Fig. 3). Thus, the results were compared with the known pedigree of the inbred lines for each run of different K. The results showed that when K = 4, the model-based subgroups were largely consistent with known pedigrees of the inbred lines. The four subgroups corresponded to the four major germplasm origins in China, i.e., Lancaster, Reid, Tangsipingtou (TSPT) and P (Fig. 4).

Fig. 3
figure 3

Plot of the Ln probability of data, Ln P(D), averaged over the replicates

Fig. 4
figure 4

Population structure of 95 inbreds based on 145 SSRs

The Lancaster subgroup was the largest subgroup and included 30 inbred lines. The lines closely related to the Mo17 pedigree and the Zi330 pedigree. In addition, some lines derived from a landrace “Ludahonggu” were also designed into this subgroup. The next subgroup TSPT had 13 inbred lines, which were mainly derived from Huangzaosi, one of founder parents in maize breeding of China. The Reid subgroup included 12 inbred lines. The P subgroup was smallest with only 7 inbred lines, in which Shen137 was derived from a Pioneer hybrid “6JK111” and the other 6 lines were all derived from “78599”. Additionally, 33 inbred lines that had <0.8 membership in each of the four subgroups and had a mixture of two or more subgroups were assigned to a mixed subgroup (Table 4).

Table 4 Membership of inbred lines corresponding to each subgroup

Genetic diversity of subgroups

The genetic diversity for each subgroup was assessed (Table 5). The Lancaster subgroup was the most diverse subgroup, with a total of 924 alleles, 6.37 alleles per locus, and gene diversity of 0.65. The next was the TSPT subgroup, which had 649 alleles totally, 4.48 alleles per locus, and gene diversity of 0.58. The Reid subgroup was less diverse than the TSPT subgroup. In addition, among all of the alleles, 35.09% were subgroup-specific. The Lancaster subgroup had more subgroup-specific alleles (256 or 27.71%) and 8.53% were line-specific. This also indicated that the Lancaster subgroup included higher genetic variation.

Table 5 Summary statistics for each subgroup

To understand the effect of subgroup size, a re-sampling strategy was also performed to evaluate the subgroup diversity. Since the P subgroup only had 7 inbreds, we randomly selected 7 inbreds from other three subgroups to calculate the total number of alleles and the gene diversity. The resulting data from 1,000 repetitions were averaged to assess the genetic diversity for each subgroup. The results showed that the trend of genetic diversity for each subgroup remained the same although the same number of inbreds was randomly selected from other subgroups except the P subgroup (Figs. 5, 6).

Fig. 5
figure 5

Comparison of number of alleles for all samples and 7 random samples in each subgroup

Fig. 6
figure 6

Comparison of gene diversity for all samples and 7 random samples in each subgroup

Linkage disequilibrium

Linkage disequilibrium (LD) among SSRs was investigated in the entire set of inbred lines and in each of the subgroups. In the 95 inbred lines, LD was significant at a 0.01 level between 63.89% of the SSR pairs, but the proportion within each of the subgroups was less (Table 6). Furthermore, the percentage of SSR pairs in LD in the Lancaster subgroup was much higher than the other subgroups, and the P subgroup was the lowest. Because the statistical ability of LD depends on the sample size, a re-sampling strategy was adopted to calculate the expected proportion of pair-wise SSR loci in significant LD. The results showed that when we selected random samples of the same size as in each subgroup from the entire set of inbred lines, the observed and expected proportion of significant pair-wise LD was almost equal. This indicated that sample size substantially contributed to the higher percentage of pair-wise SSR loci in LD in the entire sample than in each subgroup, but population structure and relatedness did not remarkably affect the LD in the subgroups.

Table 6 Percentage of SSR pairs in LD at a < 0.01 level

In order to investigate the relationship of linkage and LD, we estimated the percentage of linked SSR loci pairs in significant LD in the entire set of inbred lines and in each subgroup. In the 95 inbred lines, 83.33% linked pair-wise SSR loci were in significant LD at the 0.01 level on average. For each model-based subgroup, most of the linked SSR loci were in significant LD though the percentage varied among chromosomes (Table 7). Overall, linkage was the main factor resulting in the pair-wise SSR loci with significant LD in the entire sample and each subgroup.

Table 7 Percentage of linked SSR loci pairs in significant LD in overall set and different subgroups

Discussion

Genetic diversity of the mini core inbred lines

Choice of germplasm is one of the key factors determining the resolution of association mapping. In order to detect more alleles, germplasm selected should include all the genetic variation of a specific species theoretically because diverse germplasm include more extensive recombination in the history and allow a high level of resolution. The species for which a core collection has been established, the core would be the idea material for association mapping (Whitt and Buckler 2003). We have constructed the core collection for maize germplasm preserved in Chinese National Genebank which included 951 landraces and 242 inbred lines (Li et al. 2004). Later, these 242 inbred lines and 46 elite lines used in recent years in Chinese breeding programs were genotyped for genetic diversity at 49 SSR loci, and a mini core set of 94 inbred lines representing 87% of the SSR allelic diversity of the 288 inbred lines was defined (Yu et al. 2007). During the definition, some lines of great agronomical importance were included and the mini core panel selected represented the maximum number of alleles of the 288 inbred lines. These 94 inbred lines together with B73 constituted the association mapping population for further analysis.

In the present study, 145 SSR loci, randomly distributed across the genome, were used to detect the genetic diversity of the population. A total of 1,365 alleles with an average of 9.4 alleles per locus were detected in the entire population, and the average gene diversity and PIC was 0.6831 and 0.6439, respectively. The genetic diversity was much higher than that of Xie et al. (2007, 2008) (PIC was 0.615), and close to that of Stich et al. (2005) (genetic diversity was 0.68) and Matsuoka et al. (2002) (gene diversity was 0.62), but much lower than that of Liu et al. (2003) (gene diversity was 0.82). The main reason for the difference was the germplasm under study and the SSRs used. The higher genetic diversity detected by Liu et al. (2003) was mainly due to the broad range of germplasm and more dinucleotide type of SSRs. Mutation rate of dinucleotide type of SSR was much higher than that of other types (Vigouroux et al. 2002), which was confirmed in the present study.

Population structure of the mini core inbred lines

Chinese maize inbred lines often have complex genetic background; therefore, understanding population structure and relationships among inbred lines is of significant importance for maize improvement and association analysis. In the present study, 145 SSRs that covered the entire maize genome were selected to analyze the population structure of the 95 inbreds. We selected ≥0.8 membership as the subgroup subdivision criterion and the analysis showed that when K = 4, the model-based subgroups were consistent with known pedigrees of the inbred lines, and the subgroups were consistent with the four major empirical germplasm origins, i.e., Lancaster, Reid, TSPT and P subgroup. Among all of the inbred lines, 65.3% were assigned into the corresponding subgroups. Lancaster, TSPT, Reid and P subgroup accounted for 31.6, 13.7, 12.6 and 7.4% of the entire population, respectively.

The results in this study showed that the derivatives of Zi330, Mo17 and Ludahonggu had high genetic similarity, and thus were classified into the Lancaster subgroup. The Reid germplasm were introduced from the USA during the period from the 1950s to the 1970s. Chinese breeders developed a lot of inbred lines from these germplasm. For example, Ye8112 was selected from the maize hybrid “8112”, 5003 was selected from the maize hybrid “3147”, and a series of inbred lines such as Ye478, 488, Liao7794 and Liao5110 were derived from 5003. These lines were subdivided into the Reid subgroup in our analysis. Since the late 1980s, some Pioneer hybrids have been introduced into China; therefore, a new group defined as “P” was generated (Wang et al. 2004). “78599”, one of the most important hybrids among them, was used widely in selecting inbred lines. Up to now, more than 100 hybrids have been released by using “78599”-derived inbreds. In our population, a few lines were selected from “78599”.

Genetic structure of Chinese maize inbred lines was documented in a few previous studies. The consistent opinion was that Chinese maize inbred lines could be classified into 4–6 subgroups, most corresponding to the heterotic groups established according to the pedigree information and combining ability (Peng et al. 1998; Wang et al. 1998, 1999; Yuan et al. 2001; Xie et al. 2007, 2008). Recently, Xie et al. (2007, 2008) analyzed 187 commonly used Chinese maize inbred lines, representing the genetic diversity among public, commercial and historically important lines for maize breeding, and detected six subpopulations, that is, BSSS, PA, PB, Lancaster, Ludahonggu (LRC) and TSPT. But when only three clusters were allowed, the clusters were associated with geographic origins, i.e., A (PA, BSSS, Lancaster), B (PB) and D (LRC, TSPT). Interestingly, Ludahonggu is a landrace originally grown in Luda, Liaoning Province of Northeast China, and was probably introduced from the USA in the 1920s (Li et al. 2002). Previously, the derivatives of Ludahonggu were regarded as a dependent group called “Ludahonggu group” (Li et al. 2002; Peng et al. 1998; Xie et al. 2007, 2008). Other reports, on the other hand, suggested that Dan340, a typical Ludahonggu-derived inbred line, could be classified into Zi330 group (Sun et al. 1999; Yuan et al. 2001; Li et al. 2003; Teng et al. 2004; Zheng et al. 2006). Our results also supported this classification. In addition, the integration of PA and BSSS identified by Xie et al. (2008) was also accepted by breeders and researchers in China, since they contained the Reid germplasm origin. Although some discrepancies of the results existed among different researches, which probably resulted from the difference of materials and SSRs (type and number) used, the general profile of genetic structure of Chinese maize inbred lines was largely consistent.

In addition, the remaining 33 inbred lines which had a membership <0.8 with none of the four subgroups were classified into a mixed subgroup, and accounted for 34.7% of the total inbred lines. Among the 33 inbreds, 9 lines were selected from foreign hybrids, 3 lines from Chinese hybrids, and 2 lines from Chinese landraces. This also indicated that the mini core set of inbred lines came from wide origins and contained extensive genetic variation. On the other hand, population structure analysis could help us understand the genetic composition of lines, especially for those with unknown pedigree information, such as 87-20, Yi67, DaMo, CML67 and H205. Unexpectedly, an important inbred line from the US, C103 had 52.9% similarity with the TSPT germplasm and 46.5% similarity with the Lancaster germplasm. This needs to be investigated further although it does not suggest that C103 originated from TSPT, a Chinese landrace.

Linkage disequilibrium and the forces causing LD

In the present study, 63.89% of the SSR pairs exhibited significant LD; however, in each model-based subgroup the percentage of SSR pairs in LD was much lower with a range of 18.72–40.28%. The result was considerably higher than that of Remington et al. (2001), possibly due to the higher density of SSRs used in our study. However, it was lower than that of Stich et al. (2005) and comparable to the results reported by Liu et al. (2003), which was in accordance with the previous studies that LD level detected by SNPs or SSRs would be higher in narrow germplasm than in diverse germplasm (Ching et al. 2002; Liu et al. 2003).

LD observed in a population is the result of interplay of many factors including linkage, population structure, relatedness, selection, mutation and genetic drift (Huttley et al. 1999; Flint-Garcia et al. 2003; Rafalski and Morgante 2004; Gupta et al. 2005). Forces generating and conserving LD in a population were paid more attention to in recent years, and have been demonstrated by experimental data (Stich et al. 2005, 2006) and computer simulations (Stich et al. 2007). LD generated by linkage is considerably useful for genome-wide association mapping. But LD generated by population structure and genetic drift would result in spurious marker-trait associations. As for LD generated by selection, mutation and relatedness, the influence just depends on the population under consideration. Additionally, since Vigouroux et al. (2002) suggested that mutation rate of different types of SSRs in maize was very low, the influence of mutation on LD of SSR loci could be neglected. In our analysis, different number of random samples equal to the number of inbred lines for each subgroup was selected from the entire samples. The results showed that the expected percentage of SSR loci in significant LD was almost the same to that of in each subgroup. This indicated that the population structure, the relatedness, and the genetic drift did not strongly influence the LD of SSR loci in each subgroup. As the high percentage of linked SSRs in significant LD in the entire sample and each subgroup, linkage was assumed to be the major force that generated LD in both the entire sample and each subgroup.

In the present study, a mini core set of maize inbred lines consisting of 95 inbreds for association mapping has been constructed. Diversity analysis by using 145 SSR loci which covered the entire maize genome showed that the population was representative for Chinese maize inbred lines and included diverse genetic variation. Population structure analysis showed four subgroups existed in the population. Though many factors contributed to the LD between SSR loci, linkage was the major force generating and conserving LD of SSR loci. The results suggested that the population may be used in the detection of genome-wide SSR marker–phenotype association.