Introduction

The concept of core collection was proposed by Frankel (1984). A core collection is defined as a representative sample of the whole collection with minimum repetitiveness and maximum genetic diversity of a crop species and its relatives (Frankel and Brown 1984a, b; Brown 1989). The core collection is served as a working collection that could be evaluated and utilized preferentially, which could solve the problem of large size of collection hindering the preservation and utilization of germplasm resource. Core collection is a convenient way to study and utilize germplasm resources and has been received the extensive attention all over the world.

Cluster analysis has been widely used as an important tool to group accessions for constructing core collection (Hintum 1995; Zhang et al. 2004). For example, cluster analysis was used to separate similar accessions to establish chickpea core collection and chickpea mini core subset (Upadhyaya and Ortiz 2001). Zewdie et al. (2004) used cluster analysis to classify accessions of capsicum based on the data of morphological traits. Then they established the capsicum core collection by three sampling methods based on results of the cluster analysis. Cluster analysis was also adopted in grouping data based on molecular markers in some researches of core collection (Baranger et al. 2004; Chabane and Valkoun 2004). However, there are several cluster methods that can be chosen during the course of cluster. Zhang et al. (2000) compared eight cluster methods when researching the construction of sesame core collection, and found that Ward’s method was most feasible. Some researches suggested that cluster methods should be combined with corresponding sampling methods during constructing core collection (Hu et al. 2000c; Li et al. 2004).

There are different strategies for sampling core accessions, such as random strategy, constant strategy, proportional strategy, logarithmic strategy and genetic diversity-dependent strategy (Brown 1989; Yonezawa et al. 1995). Hu et al. (2000a, b, c) suggested three sampling methods to select core accessions by stepwise clusters, which could construct more reliable core collections because method of stepwise clusters could avoid the unequal size of subgroups and unsymmetrical sampling. However, most of those strategies are based on cluster analysis and random sampling. The constructing results of those strategies are greatly affected by the cluster methods. Therefore, before constructing core collections, many work needs to be done to find an appropriate cluster method. The present paper proposed a strategy for constructing more reliable core collections based on the least distance stepwise sampling that did not need to consider cluster methods. Genetic diversity of core collections constructed by this method was evaluated to assess the validity of the method. Optimal parameters for constructing core collections based on this strategy were selected by simulations.

Materials and methods

Materials

An initial collection of 1,547 cotton genotypes served to construct core collections. All the 1,547 genotypes were planted for 2 years with two replications per year. The observed data of 18 quantitative traits were recorded. There were nine agronomy traits (plant height, height of fruit branch, length of fruiting node, length of boll stalk, number of fruiting branch per plant, bolls per plant, growth period, boll weight and lint percentage), five fiber traits (length, uniformity, strength, elongation and micronaire) and four seed traits (seed length, seed width, ratio of length to width and kernel weight) in the initial collection.

Genetic models and statistical methods

In the genetic experiments for evaluating germplasm resources in a single environment with at least two replications, the observed values could be expressed as \(Y_{{k(ij)}} = \mu + R_{i} + C_{j} + G_{{k(ij)}} + \varepsilon _{{k(ij)}} ,\) where μ is the population mean; R i is the fixed effect of the ith row; C j is the fixed effect of the jth column; G k(ij) is the random effect of the kth genotype within the ith row and the jth column, G k(ij)  ∼ (0, σ 2 G ); ε k(ij) is the residual effect, ε k(ij) ∼ (0, σ 2ε ). In the complicated genetic experiments, which are conducted for multiple environments with at least two replications per environment, the observed values could be expressed as \(Y_{{hk(ij)}} = \mu + E_{h} + R_{{i(h)}} + C_{{j(h)}} + G_{{k(ij)}} + GE_{{hk(ij)}} + \varepsilon _{{hk(ij)}} ,\) where μ is the population mean; E h is the fixed effect of the hth environment; R i(h) is the fixed effect of the ith row within the hth environment; C j(h) is the fixed effect of the jth column within the hth environment; G k(ij) is the random effect of the kth genotype within the ith row and the jth column, G k(ij )  ∼ (0, σ 2 G ); GE hk(ij) is the random effect of the interaction between the hth environment and the kth genotype, GE hk(ij)  ∼ (0, σ 2 GE ); ε hk(ij) is the residual effect, ε hk(ij) ∼ (0, σ 2ε ).

Minimum norm quadratic unbiased estimation (MINQUE) method could be used to estimate the variance component of the genotypic effect, and the genotypic value of each accession could be unbiasedly predicted by adjusted unbiased prediction (AUP) method based on the variance component of the genotypic effect (Zhu 1993; Zhu and Weir 1996).

Constructing core collections

  1. 1.

    Constructing core collections by least distance stepwise sampling (LDSS) strategy: first, a precise sampling percentage of the core collection to the initial collection is given based on other researches. Next, the genetic distances between accessions are calculated and accessions are grouped by hierarchical cluster analysis based on the genetic distance. One accession from a subgroup with the least distance (this subgroup is unique in the whole dendrogram) is randomly removed and another accession of the subgroup is sampled. Then, the genetic distances among the remained accessions are calculated again, and the sampling is performed by the same way. The stepwise samplings are performed until the percentage of the remained accessions reaches to the given sampling percentage. By this way, a core collection is successfully constructed.

  2. 2.

    For comparing purpose, core collections by stepwise clusters with random sampling (SCR) strategy (Hu et al. 2000c) were constructed. The process is: first, the genetic distances among accessions of the initial collection are calculated. Next, accessions are grouped by hierarchical cluster analysis. One accession from each subgroup with two accessions at the lowest level of dendrogram is randomly sampled. Then, the genetic distances among the remained accessions are calculated again, which are used for the next procedure of cluster. The sampling is performed by the same way. The stepwise clusters are performed until the size of the remaining collection reaches the scale 20–30% (Yonezawa et al. 1995) of the initial collection. Thus, a core collection is successfully constructed.

Four distances (city block distance, Cityblock; Euclidean distance, Euclid; standardized Euclidean distance, Seuclid; Mahalanobis distance, Mahal) were used to assess genetic distances among accessions. Four hierarchical cluster methods (nearest distance method, Single; furthest distance method, Complete; unweighted pair-group average method, Average; and Ward’s method, Ward) were used to perform clustering to construct different core collections by combining four genetic distances.

The evaluating parameters for core collection

The representativeness of a core collection could be evaluated by mean, variance, range and coefficient of variation. A homogeneity test (F test) for variances and a t test for means (α = 0.05) can be performed to determine the difference of traits between core collection and the initial collection (Hu et al. 2000c). Based on the calculated results of t test, F test, range and coefficient of variation, four more important evaluating parameters are calculated. There are mean difference percentage (MD), variance difference percentage (VD), coincidence rate of range (CR) and variable rate of coefficient of variation (VR) (Hu et al. 2000b, c). These four parameters are formulated as follows:

  • \( {\text{MD}} = {\left( {\raise0.7ex\hbox{${S_{t} }$} \!\mathord{\left/ {\vphantom {{S_{t} } n}}\right.\kern-\nulldelimiterspace} \!\lower0.7ex\hbox{$n$}} \right)} \times 100, \) where S t is the number of traits which have significant difference (α = 0.05) of means between the initial collection and core collection; n is total number of traits.

  • \( {\text{VD}} = {\left( {\raise0.7ex\hbox{${S_{F} }$} \!\mathord{\left/ {\vphantom {{S_{F} } n}}\right.\kern-\nulldelimiterspace} \!\lower0.7ex\hbox{$n$}} \right)} \times 100, \) where S F is the number of traits which have significant difference (α = 0.05) of variances between the initial collection and core collection; n is total number of traits.

  • \( {\text{CR}} = \frac{1} {n}{\sum\limits_{i = 1}^n {\frac{{R_{{{\text{C}}(i)}} }} {{R_{{{\text{I}}(i)}} }} \times 100} }, \) where R C(i) is the range of the ith trait of core collection; R I(i) is the range of the corresponding trait of the initial collection; n is total number of traits.

  • \( {\text{VR}} = \frac{1} {n}{\sum\limits_{i = 1}^n {\frac{{{\text{CV}}_{{{\text{C}}(i)}} }} {{{\text{CV}}_{{{\text{I}}(i)}} }} \times 100} }, \) where CVC(i) is the coefficient of variation of the ith trait of core collection; CVI(i) is the coefficient of variation of the corresponding trait of the initial collection; n is total number of traits.

The core collection can be considered to represent the genetic diversity of the initial collection if MD ≤ 20% and CR ≥ 80% at the same time (Hu et al. 2000c). Moreover, in the same sampling percentage, smaller MD leads to more representative core collections, and core collections with larger CR or VR are more representative.

Simulations

In order to draw consistent, stable and reproducible results, repeated samples (bootstrap) were conducted (Chandra et al. 2002). Four hundred and twelve genotypes from the same growing region were selected among 1,547 cotton genotypes to perform simulations. There were k = 1,000, 1,500 and 2,000 independent random samples from the initial collection of a particular combination (a sampling percentage combining with a genetic distance and a cluster method). In each sample, the core collection was constructed and the four evaluating parameters above were calculated. Therefore, each combination generated four resampling populations of evaluating parameters when k = 1,000, 1,500 or 2,000. Based on the results of simulations, normality tests were performed and mean, median, upper-0.025-quantile and upper-0.975-quantile were calculated in each population.

The validation of core collections

Four hundred and twelve genotypes from the same growing region were treated by the principal components analysis to valid the core collections. Distribution of the core accessions and the reserved accessions was plotted by the first two principal components in the sampling percentage of 10 and 30%.

Data management

Before constructing core collections, genotypic values of each trait were standardized (μ = 0, σ = 1; where μ is the population mean of the trait and σ is the standard deviation of the trait). Normality tests were performed using Univariate procedure in SAS software (version 8.01). Other experiments were conducted in MATLAB software (version 6.5).

Results

Comparison between LDSS strategy and SCR strategy in the same sampling percentage

Constructing core collections from the 1,547 cotton genotypes in the same sampling percentage, the cluster times of LDSS strategy were far more than those of SCR strategy, and could be formulated as follows: cluster times = the number of initial accessions − the number of core accessions (Table 1). All MDs of core collections constructed by the two strategies were 0% and all CRs of core collections constructed by the two strategies were over 90%. Most VDs of core collection constructed by the two strategies were 0%. The CR and VR of core collections constructed by LDSS strategy were larger than those of core collections constructed by SCR strategy in the same combination (Table 1).

Table 1 Comparison between core collections constructed by least distance stepwise sampling strategy and stepwise clusters with random sampling strategy in the same sampling percentage

The representation of LDSS strategy in different cluster methods

When the same sampling percentage and genetic distance were used to construct core collections by LDSS strategy, the four cluster methods produced the same values for each evaluating parameters. By comparing the accessions in each core collection, all those four core collections with the same sampling percentage and genetic distance were composed of completely same accessions whether based on simulated data or true data. Table 2 showed the representation of LDSS strategy based on true data.

Table 2 Changes of evaluating parameters of core collections constructed by least distance stepwise sampling strategy with four genetic distances and four cluster methods in three sampling percentages

Comparison of genetic distances for LDSS strategy by simulation

The results were similar for k = 1,000, 1,500 and 2,000. Therefore, only results for k = 1,000 were listed and discussed. Normality tests showed that all resampling populations of evaluating parameters were not normal distribution. Therefore, median could be considered as estimate instead of mean, and upper-0.025-quantile and upper-0.975-quantile formed confidence interval at the significant level of 0.05 of the evaluating parameter. Since all cluster methods generated the same core collections under the same sampling percentage and genetic distance, the results of simulations on core collections constructed just by single-cluster method combining with four genetic distances were listed in present paper (Table 3). Except for Seuclid in 10% sampling percentage, all medians of MD were 0% and all medians of CR were over 85% (Table 3). Except for Mahal in the sampling percentage of 10%, all genetic distances had zero upper-0.025-quantile and the same upper-0.975-quantile of MD in all the three sampling percentages (Table 3). Compared to the two genetic distances of Cityblock and Euclid, Mahal and Seuclid generated larger median, upper-0.025-quantile and upper-0.975-quantile of VD, CR and VR in the same sampling percentage. Mahal generated slightly larger median, upper-0.025-quantile and upper-0.975-quantile of CR compared to Seuclid in the same sampling percentage, while those parameters of VR of Seuclid were larger than Mahal especially in small sampling percentage. Cityblock generated larger median, upper-0.025-quantile and upper-0.975-quantile of CR and VR in the same sampling percentage compared to Euclid (Table 3). Changes of VD for Cityblock and Euclid were similar (Table 3).

Table 3 Simulations on core collections constructed by least distance stepwise sampling strategy with single-cluster method combining with four genetic distances for 1,000 independent random samples

Validation of core collections by the principal components analysis

The above results suggested that core collections constructed by LDSS strategy were more representative than those constructed by SCR strategy, and Seuclid was more suitable for constructing core collections than Cityblock, Euclid and Mahal based on LDSS strategy. The principal component analysis was conducted further to validate core collections constructed by Seuclid based on LDSS strategy. Core accessions were selected symmetrically throughout the whole collection in all the two sampling percentages (Fig. 1). Most extreme accessions were selected in 10% sampling percentage and almost all those were selected in 30% sampling percentage (Fig. 1). The plots illustrated that the genetic diversity of the initial collection was organized in some degree. By LDSS strategy, only one accession was selected from each region with similar accessions, which avoided redundance efficiently (Fig. 1).

Fig. 1
figure 1

Principal component plots of core accessions and reserve accessions in the sampling percentage of 10 and 30%. Core collections were constructed by least distance stepwise sampling (LDSS) strategy based on Seuclid genetic distance combining with single-cluster method

Discussion

For the research of constructing core collection, phenotypic values are mainly used (Zhang et al. 2000; Fundora et al. 2004; Volk et al. 2005). To achieve phenotypic values of germplasm materials, field experiments are required. Most traits of germplasm materials are quantitative traits under the control of polygenes, which means that they are easily affected by field conditions and experimental errors. Moreover, the effects of interaction between gene and environment (GE effects) exist in phenotypic values (Hu et al. 2000c). Therefore, stratification based on phenotypic values could not essentially reflect genetic relationship among accessions, and core collection based on phenotypic values may not accurately represent genetic diversity of the initial collection (Tanksley and McCouch 1997). Genotypic values could be predicted from phenotypic values by mixed linear model approach, which eliminate effects of experimental errors, environmental effects and GE effects. Stratification based on genotypic values can reflect genetic relationship among accessions more accurately. Therefore, a core collection constructed based on genotypic values will be more representative than that constructed based on phenotypic values (Hu et al. 2000c). Two types of mixed linear model were introduced in the present paper. One is suitable for analyzing experimental data in single environment; the other is suitable for analyzing experimental data in multiple environments. When genetic experiment is performed in multiple environments, the environmental effects and GE effects could be decomposed from the observed values by the mixed linear model described before, which leads to more precise predicting values of genotypic effects than in single environment. Therefore, performing genetic experiment in multiple environments will draw more accurate results in constructing core collections. In present research, core collections were constructed based on genotypic values from multiple environments genetic experiment.

The genetic diversity of a collection was not randomly dispersed but may be organized to varying degrees (Balakrishnan et al. 2000); the principal components analysis of present research proved it. Accessions from the same growing region have more similarity than those from different growing region. A population consisted of accessions with small genetic difference is more efficient to investigate the validity of different constructing strategies. Present results showed that the population size of 412 genotypes was available to evaluate different genetic distances. Moreover, the running time of the simulating program for LDSS strategy was too long to be afforded if the number of accessions were over 500, even in high-powered computers. Therefore, 412 genotypes from the same growing region were used in present research.

Both SCR strategy and LDSS strategy are based on hierarchical cluster. In the process of using SCR strategy to construct core collections, each procedure of sampling is performed in all subgroups at the lowest level of the dendrogram, and redundant accessions in these subgroups are removed. Different cluster methods will generate different subgroups at the lowest level of the dendrogram. However, the subgroup with the least distance is unique in the dendrogram, and all common used hierarchical cluster methods (nearest distance method, furthest distance method, centroid method, unweighted pair-group average method, weighted pair-group average method and Ward’s method) generate the same least distance subgroups (Yang et al. 1989). LDSS strategy performs sampling in the subgroup with the least distance of the dendrogram in each procedure of stepwise sampling. Therefore, given the same random sampling order, all hierarchical cluster methods will construct core collections with same accessions.

In general methods of constructing core collections by clusters, cluster method is one of the important factors that will affect the results of core collection. However, while using LDSS strategy, as long as the genetic distance and the sampling percentage were fixed, cluster methods need not be considered because of the properties of LDSS strategy. Serving for plant breeding is an important aim for constructing core collection. A well-representative core collection is an extremely useful resource for breeders, because it can save much expense and time in the course of plant breeding. Present results showed that constructing core collections by LDSS strategy with Seuclid distance seems to be an excellent strategy to assist constructing well-representative core collections.