Development of an allele-mining set in rice using a heuristic algorithm and SSR genotype data with least redundancy for the post-genomic era

Zhao, Weiguo; Cho, Gyu-Taek; Ma, Kyung-Ho; Chung, Jong-Wook; Gwag, Jae-Gyun; Park, Yong-Jin

doi:10.1007/s11032-010-9400-x

Development of an allele-mining set in rice using a heuristic algorithm and SSR genotype data with least redundancy for the post-genomic era

Published: 16 February 2010

Volume 26, pages 639–651, (2010)
Cite this article

Download PDF

Access provided by CONRICYT-eBooks

Molecular Breeding Aims and scope Submit manuscript

Development of an allele-mining set in rice using a heuristic algorithm and SSR genotype data with least redundancy for the post-genomic era

Download PDF

Weiguo Zhao^1,2,3,
Gyu-Taek Cho⁴,
Kyung-Ho Ma⁴,
Jong-Wook Chung^1,2,
Jae-Gyun Gwag⁴ &
…
Yong-Jin Park^1,2

501 Accesses
34 Citations
Explore all metrics

Abstract

The allelic diversity of a collection of 4046 rice accessions was assessed using 15 neutral SSR markers distributed throughout the genome. A total of 482 alleles were detected; the average allelic richness was 32.1 alleles per locus. Using a heuristic approach, an allele-mining set was successfully developed on the basis of SSR marker data. 162 accessions of the allele-mining set, accounting for about 4.0% of the entire collection, captured all of the alleles (482) retained in the entire collection, which showed 100% coverage of alleles with minimum redundancy. As a result of validation of this heuristic approach using another 14 SSR markers associated with starch, 70% of the total alleles and 83% of the restricted alleles (allele frequency > 0.05%) were captured in this allele-mining set. The results showed that the heuristic approach meets the condition as an allele-mining set even when applied to another specific set of markers related to starch synthesis in the same entire and allele-mining set. The newly developed methodology for developing allele-mining sets can be used in other crop species. By retaining all alleles of the entire collection, this allele-mining set will be useful for future studies on introducing unused useful alleles into elite rice varieties by breeders in the post-genomic era.

Allele mining and enhanced genetic recombination for rice breeding

Article Open access 25 November 2015

Genomics-Assisted Allele Mining and its Integration Into Rice Breeding

Haplotype-based allele mining in the Japan-MAGIC rice population

Article Open access 12 March 2018

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Rice (Oryza sativa L.), as a model cereal species, is one of the most important crops in the world and provides the main energy resource for more than half the world’s population (Yu et al. 2002). The survival of mankind in the future will depend to a great extent on the quantity and diversity of germplasm collections. Therefore, many countries and organizations have established hundreds of genebanks and have conserved millions of crop germplasm resources. For instance, the International Rice Genebank at the International Rice Research Institute (IRRI) maintains a collection of more than 108,925 rice accessions (http://www.cgiar.org/pdf/newsroom_svalbard_irri_shipment.pdf); there are also many other large rice collections in countries such as China and India. With the rapid increase in the number of accessions contained in crop germplasm collections, many genebanks face the problems of redundant resources and the cost of maintaining these collections, which may be an obstacle for their full exploitation, evaluation and utilization (Holden 1984). For the convenience of management, research and application, Frankel and Brown (1984) proposed the concept of the core set. The design of the core set should include the maximum possible genetic diversity contained in the entire collection with a minimum of repetitiveness. The information obtained from such a core set can aid in the judicious use of the entire collection. To date, most core sets have been developed on the basis of passport data giving the geographical origin, morphological and phenotypic traits, and biochemical or molecular markers in many crops (Perry et al. 1991; Joe and Orlando 1996; Hokanson et al. 1998; Ortiz et al. 1998; Huaman et al. 1999; Parsons et al. 1999; Chavarriaga-Aguirre et al. 1999; Marita et al. 2000; Upadhyaya and Ortiz 2001; Chandra et al. 2002). But most traits of crop varieties are quantitatively under the control of multiple genes that are easily influenced by environmental conditions, while the molecular markers reflect changes that have occurred at the DNA level but not necessarily expressed in the phenotype of the organism (Tanksley and McCouch 1997; Li et al. 2004).

A good core set should minimize redundant entries and should be sufficiently large to provide reliable conclusions for the entire collection (Brown 1989). To establish a core set, the sampling proportion and variation representation of the entire collection are important in the construction of the core set in order to retain the greatest degree of genetic diversity in it. There are many different methodologies available to build sampling strategies. These methods include simple random sampling and stratified random sampling (Peeters and Martinelli 1989; Crossa et al. 1995; Charmet and Balfourier 1995; Rincon et al. 1996; Chandra et al. 2002; Franco et al. 2003), and other sophisticated methods. For stratified random sampling, Brown (1989) proposed three allocation methods including constant (C strategy), proportional (P strategy), and logarithmic proportional (L strategy). Franco et al. (2005) proposed to use Gower’s distance between accessions within each cluster (D method) as the allocation criteria. Li et al. (2004) developed a core collection using the adjusted unbiased prediction (AUP) method based on the predicted genotypic value of rice. The clustering algorithm has now also been used as an important tool to reduce redundancy and select core sets within groups in germplasm research (van Hintum et al. 1995; Zewdie et al. 2004; Upadhyaya et al. 2006; Mosjidis and Klingler 2006), Hu et al. (2000) developed an stepwise clustering method for sampling; the least distance stepwise sampling (LDSS) has been proved to be a valid method for eliminating the influence of different clustering methods (Wang et al. 2007). A method for determining sample sizes based on genetic distances was introduced by Franco et al. (2005). Jansen and van Hintum (2007) further developed a novel sampling method for obtaining a core set using genetic distances.

Recently, molecular genetic markers have been widely used to characterize genebank collections (Bretting and Widrlechner 1995; van Hintum and van Treuren 2002). Schoen and Brown (1993) addressed the issue of how to use genetic markers to sample collections of wild crops while maximizing allelic richness. The H strategy seeks to maximize the total number of alleles in the core collection by sampling accessions from groups in proportion to their within-group genetic diversity, while the M (maximization) strategy maximizes the number of observed alleles at each marker locus. Bataillon et al. (1996) found by computer simulation that the M strategy was more effective for retaining widespread and low-frequency neutral alleles than the other sampling strategies. Gouesnard et al. (2001) developed the MSTRAT algorithm by implementing the M strategy for selecting accessions. These different approaches have been compared by Franco et al. (2006). McKhann et al. (2004), Ronfort et al. (2006) and Cunff et al. (2008) developed a nested genetic core collection using the M strategy. Kim et al. (2007) developed PowerCore software: a program applying the advanced M strategy with a heuristic search for establishing a core set.

Many genebanks all over the world contain untapped resources of distinct alleles which will remain hidden unless efforts are initiated to screen these alleles for their potential use and function; the process is known as “allele mining”, which will contribute to discovering and exploiting the hidden diversity for many complex traits (Varshney et al. 2005). The deployment of an allele-mining set, a kind of mini core set for finding new alleles from selected entries using genomic tools, has been an area of much interest for researchers, especially those working in the field of allele mining. A representative set of rice for allele mining, as the core set described, should best represent the diversity present in the entire genebank. With the rice genome sequence available (Collard et al. 2008), allele mining provides the avenue for the validation of specific gene(s) responsible for a particular trait and mining of the most favorable alleles from the rice genebank. Thus, in the post-genomics era, allele mining in a large collection of accessions will contribute to genomics research for crop improvement (Varshney et al. 2005). These developments will be a boon for plant breeders who are trying to increase yields and create new varieties which are resistant to diseases, pests, drought and salinity and/or with improved nutritional quality (Latha et al. 2004).

The objective of this study was to develop an allele-mining set, a kind of mini core set, using a heuristic approach and SSR genotype data from an entire collection of 4046 rice accessions conserved in the National Genebank of Rural Development Administration, Republic of Korea (RDA-genebank) and to evaluate the allele-mining set by applying another set of SSR markers related to starch synthesis. And the availability of the allele-mining set was also tested.

Materials and methods

Plant materials

The RDA-genebank holds 25,604 accessions of rice (Oryza sativa L.) from 60 countries (http://genebank.rda.go.kr/). From this collection, 4,046 accessions (approximately 15.8% of the total collection), including the introduced varieties, breeding lines and varieties, weedy accessions, and the Korean landraces, were selected based on the passport data in this study (Table 1). The IRRI set, a super mini-set for the DNA polymorphism test for developing a DNA bank at the IRRI genebank, was included to separate one from the others. Of these, 1,065 accessions originated from 71 countries and 2,981 accessions were from the Republic of Korea. A description of the entire rice collection used in this study is shown in Table 1.

Table 1 Accessions used in this study on developing an allele mining set

Full size table

SSR genotyping

The 29 SSR markers, including 15 neutral SSRs and 14 SSRs associated with starch synthesis in rice, were analyzed in this study. All these SSR markers were obtained from GRAMENE (http://www.gramene.org/). Markers were chosen according to their location on the rice genetic map, which gives good coverage of the whole genome map, and their suitability for high throughput genotyping. A three-primer system (Schuelke 2000), including a universal M13 oligonucleotide (TGTAAAACGACGGCCAGT) labeled with one of the fluorescent dyes 6-FAM, NED, and HEX, allowing PCR products to be triplexed during electrophoresis, a special forward primer composed by the concatenation of the M13 oligonucleotide, and the specific forward primer, were used for SSR PCR amplification. DNA amplifications were performed using an MJ Research PTC-100 96 Plus thermal cycler. PCR reactions were carried out in a volume of 15 μL containing 10 ng of total DNA, 10× PCR buffer, 0.25 mM of each dNTP, 8 pmol of each primer, and 1 U of Taq polymerase using a touchdown procedure. Information on primer sequences and PCR amplification conditions for each set of primers are available at http://www.gramene.org/. SSR alleles were resolved on the ABI PRISM 3100 DNA sequencer (Applied Biosystems, Foster City, CA, USA) using GENESCAN 3.7 software and sized precisely using GeneScan 500 ROX (6-carbon-X-rhodamine) molecular size standards (35–500 bp) with GENOTYPER 3.7 software (Applied Biosystems).

Development of an allele-mining set

The advanced M strategy by a modified heuristic algorithm implemented in the PowerCore software by Kim et al. (2007) was used to develop the allele-mining set. The PowerCore software maximizes the number of alleles with the least redundancy in the SSR data set (Kim et al. 2007), In the PowerCore software, the A* algorithm, a heuristic algorithm that finds the optimum path from the initial to the final stages, was used:

$$ f(n) = g(n) + h(n) $$

Here, with g(n) as the number of accessions inserted into the frequency table and h(n) as the maximum number of empty cells within each column, this algorithm expands the paths that have the lowest value for g(n) + h(n), where g(n) is the cost for the path from the initial state to the current node and h(n) serves as an estimate of the cost for the cheapest path from that node to the designated node. When expanding each of the steps, the sum of g(n) and h(n) will be evaluated and the accession with the lowest value will be chosen. If h(n) is admissible without overestimating the costs of reaching the goal, then A* will always find an optimal solution (http://genebank.rda.go.kr/PowerCore/) (Kim et al. 2007).

The efficiency of the sampling strategy was assessed by comparing the total number of alleles captured using a modified heuristic algorithm in samples of increasing size to the number of alleles captured in random sampling and stratified random sampling according to geographic region and variety type from the same entire collection (Brown 1995; Qiu et al. 2003; Yan et al. 2007). Fifty independent samplings were made in each case (q = 50).

Validating the core collection

Use of the allele-mining set may improve the efficiency of germplasm evaluation by reducing the number of accessions evaluated to increase the probability of finding genes of interest. To see the effectiveness of the allele-mining set in this study, another set of SSR markers was tested on the entire and the allele-mining set. The set of 14 SSR markers was selected according to their association with starch synthesis. The coverage of alleles was compared between the entire accessions and the allele-mining set constructed by 15 neutral SSR markers.

Data analysis

The total number of alleles per locus, the number of rare alleles per locus (i.e. alleles with frequency lower than 5%), the number of unique alleles per locus (alleles occurring in only one accession), Shannon and Weaver diversity index (I) (1949), Nei’s gene diversity index (H) (Nei 1973), and the polymorphism information content (PIC) per locus were calculated for the entire and the core accessions using PowerCore (Kim et al. 2007) and Powermaker 3.25 software (Liu and Muse 2005) based on Rogers’ distance (Rogers 1972). All the indices were calculated independently in both the entire and the allele-mining set to determine whether the diversity for each locus was retained in the allele-mining set. Frequency distributions for each locus were determined using Microsoft Excel 2007 software. Statistical analysis was conducted using the univariate and correlation procedures of SPSS 14.0 (http://www.spss.com/) and Statistica 7.0 (http://www.statsoft.com/) statistical software.

Nei’s gene diversity (H) was calculated based on the formula

$$ H = 1 - \sum\limits_{i = 1}^{n} {\left( {{\frac{ni}{N}}} \right)}^{2} $$

where n _i is the allele frequency at the ith locus, n is the number of alleles at this locus and N is the total number of accessions.

The Shannon–Weaver diversity index (I) as presented was estimated using

$$ I = - \sum\limits_{i = 1}^{n} {p_{i} \log_{e} p_{i} } $$

where p _i is the frequency of the phenotypic class.

The PIC for each marker was calculated based on the formula

$$ {\text{PIC}} = 1 - \sum\limits_{i = 1}^{n} {p_{i}^{2} } - \sum\limits_{i = 1}^{n} {\sum\limits_{j = i + 1}^{n} {2p_{i}^{2} p_{j}^{2} } } $$

where P is the relative frequency of the jth pattern for SSR marker i (Botstein et al. 1980).

Results

Allele mining of 4046 rice accessions

The allelic diversity of a collection of 4046 rice accessions was assessed using 15 neutral SSRs distributed throughout the genome and the resulting statistics are summarized in Table 2. A total of 482 alleles were detected ranging from 15 (RM246) to 61 (RM206) with an average allelic richness of 32.1 alleles per locus. The total number of rare alleles (398) represented about 82.6% of the total number of alleles, showing that most alleles are at low frequency (Supplementary Fig. 1). The mean Shannon–Weaver diversity index and Nei’s gene diversity were 2.716 and 0.932, respectively. The PIC ranged from 0.7338 to 0.9333 with an average of 0.8448 (Table 2).

Table 2 Total number of alleles, number of rare alleles and genetic diversity index for 15 neutral SSR loci in the entire accessions and allele-mining set

Full size table

Development of an allele-mining set

The 482 alleles detected at 15 neutral SSR loci were used to develop the allele-mining set using the PowerCore software (http://genebank.rda.go.kr/PowerCore). The basis of developing an allele-mining set using PowerCore is the nominalization of variables, leading to a decrease in the number of accessions in the allele-mining set, which was considered necessary in performing the heuristic search through its evaluation function using the given data (Kim et al. 2007). Figure 1 showed that, in this case, the sampling efficiency (i.e. the ability to capture allelic diversity) implementing a modified heuristic algorithm was always better than other strategies. Furthermore, the relative efficiency of this advanced M (maximization) strategy was highest for smaller allele-mining set samples; for instance, the heuristic approach outperformed a stratified random sampling by about 50% when sample sizes were in the range of 50–150 (i.e. an allele-mining set size is 1.2–3.7% of the entire sample size).

It was found that the allele-mining set (162 accessions) (Supplementary Table 1), accounting for about 4.0% of the entire collection, captured all of the alleles (482) of the markers presented in the entire collection, which showed 100% coverage of alleles with minimum redundancy (Table 2). Compared with other conventional sampling methods, the heuristic approach showed the highest capturing efficiency (Table 5).

Genetic diversity of the allele-mining set

To fully realize the advantages of an allele-mining set, the allele-mining set should include most of the genetic diversity in the entire collection and be closely correlated with the entire collection (Yan et al. 2007). The allele-mining set of this study represented all SSR alleles of the entire rice collection. As shown in Table 2, the correlation coefficients (r) of mean diversity index between the allele-mining set and the entire collection were highly significant (Table 3; Supplementary Fig. 2). In this allele-mining set, all the alleles were covered and highly significant correlations were recorded for all parameters studied, which indicated that the allele-mining set effectively represented the genetic diversity of the entire collection. We also compared allele frequencies of the SSR markers in the allele-mining set with the frequencies observed in the entire collection. The frequency of alleles between them was very significantly correlated (r = 0.87; P < 0.01) (Fig. 2), indicating that not only were the same alleles represented but also similar frequencies were represented.

Table 3 t-test results between the entire collection and the allele-mining set

Full size table

Validation of the allele-mining set

The construction of a so-called “allele-mining set” from a large germplasm collection is a situation where allelic richness is a relevant measure of diversity (Schoen and Brown 1993; Bataillon et al. 1996), because as many alleles as possible should be retained in the allele-mining set, where they would be available for phenotypic screening and breeding programs. To validate the heuristic approach, the same accessions in this study were assessed in a set of 14 additional SSR markers related to starch synthesis between the same entire and allele-mining sets. These markers are different from the markers used to build the allele-mining set. Statistics describing the allelic diversity of these 4046 accessions for 14 additional SSR markers are summarized in Table 4. 214 alleles were detected with the 14 SSR markers in the entire collection. The number of alleles per locus ranged from 4 to 34 with an average of 15.3. Compared with the neutral SSRs, the SSRs related to quality had smaller polymorphism (Fig. 3) and the distributions of frequency of alleles per locus were different between them (Fig. 4). For these 14 markers, PIC ranged from 0.3423 to 0.8923, with an average of 0.6249. Table 5 summarizes the total number of alleles detected for the two types of markers. For the second set of 14 markers, 70% of the alleles observed in the 4046 accessions were captured in the allele-mining set. For association studies, Malysheva-Otto et al. (2006) thought that rare alleles occurring in more than 0.5% of investigated accessions should be referred to as widespread or often occurring alleles, since markers with low allele frequencies would need to have a very strong effect to be detected. There are 29 unique alleles in the second set of SSRs (only 3 unique alleles were kept in the allele-mining set), but the allele-mining set represented 83% diversity of the 176 restricted alleles (frequencies > 0.05%, corresponding to two out of the entire accessions of the set) retained in the entire accessions (Cunff et al. 2008). Even if not all alleles of useful genes were captured, the heuristic approach also does better than other sampling strategies (Table 5). Given the nature of the allele-mining set, it is impossible to guarantee the complete capture of all alleles for each gene (McKhann et al. 2004). However, the allele-mining set using the heuristic approach here eliminated the redundancy in the rice collection and succeeded in capturing most of the alleles in some genes of interest.

Table 4 The number of alleles, number of rare alleles and genetic diversity index for 14 SSR loci related to starch in the entire accessions and allele-mining set

Full size table

Table 5 Capturing total number and proportion of alleles in the same entire accessions and allele-mining set by two types of markers

Full size table

Discussion

Studies on allelic diversity have been proved to be fruitful in understanding the genetic basis of complex traits (Szalma et al. 2005). The sequencing of the complete genome of rice makes it possible to access all the genes of this species and increases the chances of exploiting the natural genetic diversity through association genetics (Varshney et al. 2005; Collard et al. 2008). However, our basic knowledge of the extent of allelic variation within the species is still not sufficient. Considering the huge numbers of accessions that are held collectively by genebanks, germplasm collections are thought to harbor a wealth of undisclosed allelic variants. Mining alleles will improve the efficiency of conservation and use of genetic resources (Kresovich et al. 2002; Varshney et al. 2005). The allelic richness of 32.1 observed in our study was much higher than previously reported by Garris et al. (2005) (mean 11.8) using 169 SSRs and 234 rice accessions, and Ebana et al. (2008) (mean 7.7) using 23 SSRs and 236 Japanese rice landrace accessions, indicating higher levels of allelic diversity. We also compared the alleles within different populations: the allelic richness was weedy>introduced>landrace>bred>IRRI accessions (Supplementary Table 2). After comparing allelic richness with the respective index of genetic diversity, we found that allelic richness was significantly associated with the genetic diversity index, the correlation coefficients (r) between allelic richness and Shannon–Weaver diversity index, Nei’s gene diversity and PIC were 0.903, 0.748 and 0.560, respectively. Furthermore, the high proportion (82.6%) of rare alleles found in our sample indicated that, conversely, there exist many informative alleles to be mined in the rice collection (Table 2).

To devise plant breeding strategies for crop improvement, a breeder would ideally like to know the relative value of all alleles for genes of interest in the primary germplasm, an unlikely prospect. However, information can be gathered by establishing the allele-mining set (Varshney et al. 2005). So the development of an allele-mining set, which represents the genetic diversity of a crop with minimal redundancy and increases utility of the collection as a whole, is especially important as the funding for germplasm collections decreases (Marita et al. 2000). Many core sets were successfully developed after Frankel proposed the theory of the core set in 1984, but the selection of an appropriate sampling strategy is still important in the construction of a core set. In this paper, we successfully developed an allele-mining set by a heuristic approach with least redundancy in rice. The heuristic method implemented in PowerCore software captured 100% of the allelic diversity existing in the entire collection (Kim et al. 2007). There are now many methods for developing an allele-mining set in different plants; Franco et al. (2006) demonstrated that one of the best strategies is the M (for maximization) strategy developed by Schoen and Brown (1993), maximizing allelic richness at each marker locus. Kim et al. (2007) also found that MSTRAT was the best method compared with the other conventional methods in the rice accessions, but the coverage rate was only 88.9% for SSRs. In this paper, the allele-mining set developed using PowerCore showed the highest diversity and coverage compared to those allele-mining sets developed using other sampling strategies (Table 5). The basis for the development of an allele-mining set using PowerCore is the nominalization of categorical variables, a step efficiently decreasing the number of accessions selected while capturing the maximum variation and minimizing redundancy. This lies in its capability to select entries without the comparison of relative characteristics within accessions. Instead it fills all diversity cells with the least number of entries taking into account all the possible combinations of alleles that exist through an advanced maximization strategy (M strategy) (Kim et al. 2007).

Developing an allele-mining set has been proposed as a means of increasing the use of germplasm more economically (Frankel 1984). Brown et al. (1987) recommended that the number of collections in the core set should account for 5–10% of the base collection, and that the core set should represent at least 70% of the genetic diversity in the base collection. Diwan et al. (1995) indicated that core set sampling should always be greater than 10%. Van Hintum (1995) suggested that the sampling proportion should depend on the particular objective of the core set and should vary between 5 and 20% of the base collection. In establishing a core set of rice germplasm, Li et al. (2000) found that 5% of the base collection represented 96% of the phenotypic variation, Yan et al. (2007) represented approximately 10% of the 18,412 accessions with 88% certainty, Ebana et al. (2008) established that a 20% core set represented 87.5% diversity using SSR markers. Therefore, ascertaining the best threshold value for group numbers in an allele-mining set has not yet been fully resolved. The current study showed that the heuristic method implemented in PowerCore software successfully captured all of the alleles existing in the entire collection, with a threshold value of only about 4% having the highest capturing efficiency (Table 5). Agrama et al. (2009) established that the 12% mini-core represented 100% diversity on the basis of 26 phenotypic traits and 70 SSR markers using the same software. From the above results, we found that in order to retain maximum genetic diversity in the core set, with the increase of the SSR markers, especially the allele number, the size of the core set will also increase correspondingly. Therefore, the 208 accessions of the allele-mining set were developed if we used all 29 SSRs to construct the core set.

The allele-mining set here included 100% of the 482 observed SSR alleles; among them, germplasm from Korea predominated in the allele-mining set (76 entries) due to a large number of germplasm lines (2981 accessions), followed by germplasm from the IRRI (16 entries) where many entries were acquired from the IRRI collection, followed by China having 12 entries. All germplasm types, such as introduced accessions, breeding lines, weedy types, and the Korean landraces, were included in the allele-mining set. The entire accessions originated from 72 countries, but only 29 different geographical origins were represented in the allele-mining set (Table 1); this might be because the definition of the true geographical origin of rice is sometimes difficult due to many human migration events. This could be explained by differences in allelic richness between germplasm from different geographical areas and by the status of the accessions (Supplementary Table 1). For instance, landrace accessions have more alleles than bred accessions. From Table 1 and Supplementary Table 1, we also found no relationship between allelic richness and sample number in the allele-mining set. Some were under or over-represented compared to the total sample; for example, the total number of alleles in IRRI was 205, 16 accessions were sampled from a total of 55 accessions, a high proportion (30.19%) of accessions were kept in the allele-mining set, but the correlation coefficient (r = 0.983) of the mean allele number per sample between the entire and the allele-mining set was significant at the P = 0.01 level, giving a reasonable explanation for the phenomenon. The result was very valuable for sampling in constructing an allele-mining set, because the higher the mean allele number per sample in the group, the more accessions in the allele-mining set. In addition, the IRRI accessions are a super mini-set for DNA polymorphism at the IRRI genebank, and the high proportion in the allele-mining set showed indirectly that our heuristic approach is reasonable, feasible and reliable.

In order to assess the robustness of the allele-mining set, the genetic diversity index is used in genetic studies as a convenient measure of both allelic richness and allelic evenness. Although significant correlation coefficients (r = 0.725–0.860) were found between the entire and the allele-mining set (Table 3), the total genetic diversity revealed by Shannon–Weaver diversity index (I) and Nei’s gene diversity (H) was higher in the entire collection than in the allele-mining set while PIC in the allele-mining set was higher than in the entire collection, due to the fact that they were of unequal size. So sometimes the use of indices such as I and H may be disputed (Hennink and Zeven 1991). Hennink and Zeven (1991) proposed relative indices, defined as H′ = H _mean/H _max and I′ = I _mean/I _max, respectively. By comparison, we found that H′, I′ and PIC′ (=PIC_mean/PIC_max) in the allele-mining set was similar to the entire set (Supplementary Fig. 3), indicating that H′, I′ and PIC′ indices of genetic diversity can be better used as parameters evaluating the quality of the allele-mining set.

The property of starch in rice is a very important determinant for rice quality. The method used to build the allele-mining set was validated by a second set of independent markers associated with starch synthesis in rice on a larger sample of accessions. As shown in Fig. 3, the 14 SSR markers related to starch synthesis generally showed lower diversity indices (allelic richness and all genetic diversity indices) than the 15 neutral SSRs. This could be explained by the fact that SSRs related to starch are probably more conserved than the DNA segments containing neutral SSRs. So, with the lower allelic richness and fewer rare alleles, the use of such a set of SSR markers to validate the method may have diminished the effectiveness of the validation of the allele-mining set (Balfourier et al. 2007). Maximizing the diversity of a first set of markers (15 neutral SSRs) at the same time should maximize useful gene diversity, here expressed by a second set of markers. A complete cross-validation of the method required the total sample of 4046 accessions to be tested with the second set of markers (Table 5). Cunff et al. (2008) thought that estimating the unlinked diversity within the entire collection would have been very fastidious; here the allele-mining set represented 70% of the total alleles and 83% of the restricted alleles (alleles with frequencies higher than 0.05%) observed in the 4046 accessions. This meets the accepted standard of an allele-mining set (10% of the base accessions representing more than 70% of the genetic diversity). Moreover, the allele coverage per locus is higher than with other sampling strategies, with the highest efficiency for small size cores. The results showed that the allele-mining set based on 15 neutral SSRs minimized redundancy and successfully captured the majority of the target gene alleles. Therefore, this heuristic approach can be used as an allele-mining set to uncover the loci with useful alleles and greatly facilitate the identification of useful genes, and can incorporate them into advanced breeding materials for testing and further selection (Tanksley and McCouch 1997).

In conclusion, an allele-mining set of 162 accessions (only about 4% of the entire collection) was successfully developed by a heuristic approach based on SSR markers using PowerCore software. This allele-mining set captured all of the alleles present in the entire collection and will be useful for future studies of rice gene mining to introduce unused useful alleles into elite rice varieties by breeders. Moreover, the newly presented methodology for an allele-mining set with the least allelic redundancy and maximum allelic diversity from a large germplasm collection of rice can be used in other crop species in the post-genomic era.

References

Agrama HA, Yan WG, Lee F, Fjellstrom R, Chen MH, Jia M, McClung A (2009) Genetic assessment of a mini-core subset developed from the USDA Rice Genebank. Crop Sci 49:1336–1346
Article Google Scholar
Balfourier F, Roussel V, Strelchenko P, Exbrayat-Vinson F, Sourdille P, Boutet G, Koenig J, Ravel C, Mitrofanova O, Beckert M, Charmet G (2007) A worldwide bread wheat core collection arrayed in a 384-well plate. Theor Appl Genet 114:1265–1275
Article PubMed Google Scholar
Bataillon TL, David JL, Schoen DJ (1996) Neutral genetic markers and conservation genetics: simulated germplasm collections. Genetics 144:409–417
CAS PubMed Google Scholar
Botstein D, White RL, Skolnick M, Davis RW (1980) Construction of a genetic linkage map in man using restriction fragment length polymorphisms. Am J Hum Genet 32:314–331
CAS PubMed Google Scholar
Bretting PK, Widrlechner MP (1995) Genetic markers and plant genetic resource management. Plant Breed Rev 31:11–86
Google Scholar
Brown AHD (1989) Core collections: a practical approach to genetic resources management. Genome 31:818–824
Google Scholar
Brown AHD (1995) The core collection at the crossroads. In: Hodgkin T, Brown AHD, van Hintum TJL (eds) Core collections of plant genetic resources. Wiley, Chichester, pp 3–19
Google Scholar
Brown AHD, Grace JP, Speer SS (1987) Designation of a core collection of perennial glycine. Soybean Genet Newsletter 14:59–70
Google Scholar
Chandra S, Huaman Z, Hari Krishna S, Ortiz R (2002) Optimal sampling strategy and core collection size of Andean tetraploid potato based on isozyme data—a simulation study. Theor Appl Genet 104:1325–1334
Article CAS PubMed Google Scholar
Charmet G, Balfourier F (1995) The use of geo statistics for sampling a core collection of perennial ryegrass population. Genet Resour Crop Evol 42:303–309
Article Google Scholar
Chavarriaga-Aguirre P, Maya MM, Tohme J, Duque MC, Iglesias C, Bonierbale MW, Kresovich S, Kochert G (1999) Using microsatellites, isozymes and AFLPs to evaluate genetic diversity and redundancy in the cassava core collection and to assess the usefulness of DNA-based markers to maintain germplasm collections. Mol Breed 5:263–273
Article CAS Google Scholar
Collard BCY, Cruz CMV, McNally KL, Virk PS, Mackill DJ (2008) Rice molecular breeding laboratories in the genomics era: current status and future considerations. Int J Plant Genomics 2008:1–25
Article Google Scholar
Crossa J, Basford K, Taba S, DeLacy I, Silva E (1995) Three-mode analysis of maize using morphological and agronomic attributes measured in multilocation Trials. Crop Sci 35:1483–1491
Article Google Scholar
Cunff LL, Fournier-Level A, Laucou V, Vezzulli S, Lacombe T, Adam-Blondon AF, Boursiquot JM, Patrice T (2008) Construction of nested genetic core collections to optimize the exploitation of natural diversity in Vitis vinifera L. subsp. Sativa. BMC Plant Biol 8:31
Article PubMed Google Scholar
Diwan N, McIntosh MS, Bauchan GR (1995) Methods of developing a core collection of annual Medicago species. Theor Appl Genet 90:755–761
Article Google Scholar
Ebana K, Kojima Y, Fukuoka S, Nagamine T, Kawase M (2008) Development of mini core collection of Japanese rice landrace. Breed Sci 58:281–291
Article Google Scholar
Franco J, Crossa J, Taba S, Shands H (2003) A multivariate method for classifying cultivars and studying group × environment × trait interaction. Crop Sci 43:1249–1258
Article Google Scholar
Franco J, Crossa J, Taba S, Shands H (2005) A sampling strategy for conserving genetic diversity when forming core subsets. Crop Sci 45:1035–1044
Article Google Scholar
Franco J, Crossa J, Warburton ML, Taba S (2006) Sampling strategies for conserving maize diversity when forming core subsets using genetic markers. Crop Sci 46:854–864
Article Google Scholar
Frankel OH (1984) Genetic perspectives of germplasm conservation. In: Arber W, limensee K, Peacock WJ, Starlinger P (eds) Genetic manipulation: impact on man and society. Cambridge University Press, Cambridge, pp 161–171
Google Scholar
Frankel OH, Brown AHD (1984) Plant genetic resources today: a critical appraisal. In: Holden JHW, Williams JT (eds) Crop genetic resources: conservation and evaluation. Allen & Unwin Ltd, London, pp 249–257
Google Scholar
Garris AJ, Tai TH, Coburn J, Kresovich S, McCouch S (2005) Genetic structure and diversity in Oryza sativa L. Genetics 169:1631–1638
Article CAS PubMed Google Scholar
Gouesnard B, Bataillon TM, Decoux G, Rozale C, Schoen DJ, David JL (2001) Mstrat: an algorithm for building germplasm core collections by maximizing allelic or phenotypic richness. J Hered 92:93–94
Article CAS PubMed Google Scholar
Hennink S, Zeven AC (1991) The interpretation of Nei and Shannon-Weaver within population variation indices. Euphytica 51:235–240
Article Google Scholar
Hokanson SC, Szewc-McFadden AK, Lamboy WF, McFerson JR (1998) Microsatellite (SSR) markers reveal genetic identities, genetic diversity and relationships in a Malus domestica borkh core subset collection. Theor Appl Genet 97:671–683
Article CAS Google Scholar
Holden JHW (1984) The second ten years. In: Holden JHW, Williams JT (eds) Crop genetic resources: conservation and evaluation. Allen and Unwin, Winchester, pp 277–285
Google Scholar
Hu J, Zhu J, Xu HM (2000) Methods of constructing core collections by stepwise clustering with three sampling strategies based on the genotypic values of crops. Theor Appl Genet 101:264–268
Article CAS Google Scholar
Huaman Z, Aguilar C, Ortiz R (1999) Selecting a Peruvian sweet potato core collection on the basis of morphological, ecogeographical, and disease and pest reaction data. Theor Appl Genet 98:840–844
Article Google Scholar
Jansen J, van Hintum ThJL (2007) Genetic distance sampling: a novel sampling method for obtaining core collections using genetic distances with an application to cultivated lettuce. Theor Appl Genet 114:421–428
Article CAS PubMed Google Scholar
Joe T, Orlando GD (1996) AFLP analysis of gene pools of a wild bean core collection. Crop Sci 36:1375–1384
Article Google Scholar
Kim KW, Chung HK, Cho GT, Ma KH, Chandrabalan D, Gwag JG, Kim TS, Cho EG, Park YJ (2007) PowerCore: a program applying the advanced M strategy with a heuristic search for establishing allele mining sets. Bioinformatics 23:2155–2162
Article CAS PubMed Google Scholar
Kresovich S, Luongo AJ, Schloss SJ (2002) Mining the gold: finding allelic variants for improved crop conservation and use. In: Engels JMM, Rao VR, Brown AHD, Jackson MT (eds) Managing plant genetic diversity. CABI, Wallingford, pp 379–386
Google Scholar
Latha R, Rubia L, Bennett J, Swaminathan MS (2004) Allele mining for stress tolerance genes in Oryza species and related germplasm. Mol Biotechnol 27:101–108
Article CAS PubMed Google Scholar
Li ZC, Zhang HL, Zeng YW, Yang ZY, Shen SQ, Sun CQ, Wang XK (2000) Study on sampling schemes of core collection of local varieties of rice in Yunnan, China. Sci Agri Sin 33:1–7
Google Scholar
Li CT, Shi CH, Wu JG, Xu HM, Zhang HZ, Ren YL (2004) Methods of developing core collections based on the predicted genotypic value of rice (Oryza sativa L.). Theor Appl Genet 108:1172–1176
Article CAS PubMed Google Scholar
Liu K, Muse SV (2005) PowerMarker: an integrated analysis environment for genetic marker analysis. Bioinformatics 21:2128–2129
Article CAS PubMed Google Scholar
Malysheva-Otto LV, Ganal MW, Röder MS (2006) Analysis of molecular diversity, population structure and linkage disequilibrium in a worldwide survey of cultivated barley germplasm (Hordeum vulgare L.). BMC Genet 7:6
Article PubMed Google Scholar
Marita JM, Rodriguez JM, Nienhuis J (2000) Development of an algorithm identifying maximally diverse core collections. Genet Resour Crop Evol 47:515–526
Article Google Scholar
McKhann HI, Camilleri C, Berard A, Bataillon T, David JL, Reboud X, Corre VL, Caloustian C, Gut IG, Brunel D (2004) Nested core collections maximizing genetic diversity in Arabidopsis thaliana. Plant J 38:193–202
Article CAS PubMed Google Scholar
Mosjidis JA, Klingler KA (2006) Genetic diversity in the core subset of the US red clover germplasm. Crop Sci 46:758–762
Article Google Scholar
Nei M (1973) Analysis of gene diversity in subdivided populations. Proc Natl Acad Sci USA 70:3321–3323
Article CAS PubMed Google Scholar
Ortiz R, Ruiz-Tapia EN, Mujica-Sanchez A (1998) Sampling strategy for a core collection of Peruvian quinoa germplasm. Theor Appl Genet 96:475–483
Article Google Scholar
Parsons BJ, Newbury HJ, Jackson MT, Ford-Lloyd BV (1999) The genetic structure and conservation of aus, aman and boro rices from Bangladesh. Genet Resour Crop Evol 46:587–598
Article Google Scholar
Peeters JP, Martinelli JA (1989) Hierarchical cluster analyses as a tool to manage variation in germplasm collections. Theor Appl Genet 78:42–48
Article Google Scholar
Perry MC, Mclntosh MS, Stoner AK (1991) Geographical patterns of variation in the USDA soybean germplasm collection: II. allozyme frequencies. Crop Sci 31:1356–1360
Article Google Scholar
Qiu LJ, Cao YS, Chang RZ, Zhou XA, Wang GX, Sun JY, Xie H, Zhang B, Li XH, Xu ZY (2003) Establishment of Chinese soybean (G. max) core collection. I. Sampling strategy. Sci Agri Sin 36:1442–1449
Google Scholar
Rincon F, Johnson B, Crossa J, Taba S (1996) Cluster analysis, and approach to sampling variability in maize accessions. Maydica 41:307–316
Google Scholar
Rogers JS (1972) Measures of genetic similarity and genetic distance. Stud Genet VII Univ Tex Publ 7213:145–153
Google Scholar
Ronfort J, Bataillon T, Santoni S, Delalande M, David JL, Prosperi JM (2006) Microsatellite diversity and broad scale geographic structure in a model legume: building a set of nested core collection for studying naturally occurring variation in Medicago truncatula. BMC Plant Biol 6:28
Article PubMed Google Scholar
Schoen DJ, Brown AHD (1993) Conservation of allelic richness in wild crop relatives is aided by assessment of genetic markers. Proc Natl Acad Sci USA 90:10623–10627
Article CAS PubMed Google Scholar
Schuelke M (2000) An economic method for the fluorescent labeling of PCR fragments. Nat Biotechnol 18:233–234
Article CAS PubMed Google Scholar
Shannon CE, Weaver W (1949) The mathematical theory of communication. University of Illinois Press, Urbana
Google Scholar
Szalma SJ, Buckler ES, Snook ME, McMullen MD (2005) Association analysis of candidate genes for maysin and chlorogenic acid accumulation in maize silks. Theor Appl Genet 110:1324–1333
Article CAS PubMed Google Scholar
Tanksley SD, McCouch SR (1997) Seed bank and molecular maps: unlocking genetic potential from the wild. Science 277:1063–1066
Article CAS PubMed Google Scholar
Upadhyaya HD, Ortiz R (2001) A mini core subset for capturing diversity and promoting utilization of chickpea genetic resources in crop improvement. Theor Appl Genet 102:1292–1298
Article Google Scholar
Upadhyaya HD, Gowda CLL, Pundir RPS, Reddy VG, Singh S (2006) Development of core subset of finger millet germplasm using geographical origin and data on 14 quantitative traits. Genet Resour Crop Evol 53:679–685
Article Google Scholar
van Hintum TJL (1995) Hierarchical approaches to the analysis of genetic diversity in crop plants. In: Hodgkin T, Brown AHD, van Hintum TJL (eds) Core collections of plant genetic resources. Wiley, Chichester, pp 23–34
Google Scholar
van Hintum ThJL, van Treuren R (2002) Molecular markers: tools to improve genebank efficiency. Cell Mol Biol Lett 7:737–744
PubMed Google Scholar
van Hintum ThJL, von Bothmer R, Visser DL (1995) Sampling strategies for composing a core collection of cultivated barley (Hordeum vulgare s. Iat.) collected in China. Hereditas 122:7–15
Article Google Scholar
Varshney RK, Andreas GA, Sorrells ME (2005) Genomics-assisted breeding for crop improvement. Trends Plant Sci 10:621–630
Article CAS PubMed Google Scholar
Wang JC, Hu J, Xu HM, Zhang S (2007) A strategy on constructing core collections by least distance stepwise sampling. Theor Appl Genet 115:1–8
Article CAS PubMed Google Scholar
Yan WG, Ruter JN, Bryant RJ, Bockelman HE, Fjellstrom RG, Chen MH, Tai TH, McClung AM (2007) Development and evaluation of a core subset of the USDA rice germplasm collection. Crop Sci 47:869–876
Article Google Scholar
Yu J, Hu S, Wang J, Wong GK, Li S et al (2002) A draft sequence of the rice genome (Oryza sativa L. ssp. indica). Science 296:79–92
Article CAS PubMed Google Scholar
Zewdie Y, Tong NK, Bosland P (2004) Establishing a core collection of capsicum using a cluster analysis with enlightened selection of accessions. Genet Resour Crop Evol 51:147–151
Article Google Scholar

Download references

Acknowledgments

This study was supported by Biogreen 21 project (Grant 20080401034058) of the Rural Development Administration (RDA) and a grant (Code 200803101010415) from the National Academy of Agricultural Science, RDA, Republic of Korea. This research was also supported by the 2008 KU Brain Pool of Konkuk University for Dr. Zhao Weiguo.

Author information

Authors and Affiliations

Department of Plant Resources, College of Industrial Science, Kongju National University, Yesan, 340-802, Republic of Korea
Weiguo Zhao, Jong-Wook Chung & Yong-Jin Park
Institute of Resource Sciences, Kongju National University, Yesan, 340-702, Republic of Korea
Weiguo Zhao, Jong-Wook Chung & Yong-Jin Park
Jiangsu University of Science and Technology, Sericultural Research Institute, Chinese Academy of Agricultural Sciences, 212018, Zhenjiang, Jiangsu, China
Weiguo Zhao
National Academy of Agricultural Science, RDA, 249, Suwon, 441-707, Republic of Korea
Gyu-Taek Cho, Kyung-Ho Ma & Jae-Gyun Gwag

Authors

Weiguo Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Gyu-Taek Cho
View author publications
You can also search for this author in PubMed Google Scholar
Kyung-Ho Ma
View author publications
You can also search for this author in PubMed Google Scholar
Jong-Wook Chung
View author publications
You can also search for this author in PubMed Google Scholar
Jae-Gyun Gwag
View author publications
You can also search for this author in PubMed Google Scholar
Yong-Jin Park
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yong-Jin Park.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (XLS 47 kb)

Supplementary material 2 (XLS 25 kb)

Supplementary material 3 (PPT 126 kb)

Supplementary material 4 (PPT 425 kb)

Supplementary material 5 (PPT 104 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhao, W., Cho, GT., Ma, KH. et al. Development of an allele-mining set in rice using a heuristic algorithm and SSR genotype data with least redundancy for the post-genomic era. Mol Breeding 26, 639–651 (2010). https://doi.org/10.1007/s11032-010-9400-x

Download citation

Received: 28 April 2009
Accepted: 25 January 2010
Published: 16 February 2010
Issue Date: December 2010
DOI: https://doi.org/10.1007/s11032-010-9400-x

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Development of an allele-mining set in rice using a heuristic algorithm and SSR genotype data with least redundancy for the post-genomic era

Abstract

Similar content being viewed by others

Allele mining and enhanced genetic recombination for rice breeding

Genomics-Assisted Allele Mining and its Integration Into Rice Breeding

Haplotype-based allele mining in the Japan-MAGIC rice population

Introduction