Background

Human populations display wide variation in their susceptibility to and manifestations of infectious, metabolic, or psychiatric disease. This variation has been shown by biometrical analyses to have a genetic component of greater or lesser magnitude, depending on the disease. Generally, susceptibility traits belong to the class of “complex” (also termed “quantitative”) traits. That is, their specific manifestation in an individual is controlled by the cumulative effect of many genetic factors, interacting with one another and with the life history of the individual. Because of this complex etiology, it has proven exceedingly difficult in humans to identify the individual genetic elements (“QTG,” taken in the broadest sense to include functional RNA transcripts, up- and down-stream regulatory sites, enhancers, and so on) or sequence variants within the QTG (“QTN,” taken in the broadest sense to include all sequence variants: SNPs, indels, CNVs, and larger chromosomal re-arrangements) contributing to genetic variation in susceptibility to specific diseases. This has prompted the development of mouse genetic resources for genetic analysis of complex traits. Humans and mice share widely in their genome architecture, and in their basic biology and disease susceptibility. Large numbers of inbred mouse lines that vary in their genetic architecture have been developed and are commercially available. These have formed the basis for genetic reference populations (GRPs) based on collections of inbred lines or crosses between them (Roberts et al. 2007; Silver 1995). GRPs are popular for the study of complex traits and biological systems in both medical and life sciences because genotyping is only required once (the “genotype once, phenotype many times” paradigm), replicate individuals can be produced with the same genotype in larger cohorts allowing for optimal case/control and gene-by-environment designs (Broman 2005); data from numerous experiments can be accumulated for the population allowing deep bioinformatic data mining. GRPs can be studied under defined environmental conditions, providing model systems through which the genetic elements responsible for the genetic variation of complex traits, including disease susceptibility can be mapped to defined chromosomal regions [termed “quantitative trait loci, (QTL)”] (see e.g., Groden et al. 1991; Hernandez-Valladares et al. 2004a; b; Houle 1992; Iraqi et al. 2000, 2003). Yet, with some exceptions, even in such model systems, it has not been possible to identify the actual functional QTG and QTN, in part due to the complex genetic architecture of these lines which precludes high-resolution mapping. This has led the mouse genetics community to propose and design a new “next generation” GRP, the Collaborative Cross, “CC” (Threadgill et al. 2002). The CC is a mouse GRP specifically tailored for high-resolution QTL mapping of complex traits, with special emphasis on traits relevant to human health in its broadest aspects. However, the CC goes beyond this to include expanded possibilities for analysis of epistatic interactions among QTG, identification of the biological systems and networks within which the QTG are embedded and through which their QTN exert their effect at the whole organism phenotypic level, and the interaction of these biological systems with the environment as mediated through epigenetic markings (Churchill et al. 2004; Silver 1995).

The CC resource

This unique genetic reference population will eventually comprise a set of approximately 500 recombinant inbred lines (RIL) created from almost full reciprocal matings of eight divergent strains of mice (some of the F1 crosses were not viable). These include five classical inbred lines (A/J, C57BL/6J, 129S1/SvImJ, NOD/LtJ, and NZO/HiLtJ) and three wild-derived strains: CAST/Ei, derived from M. m. castaneum mice trapped in Thailand in 1971, and PWK/PhJ and WSB/EiJ, derived, respectively, from wild M. m. musculus and wild M. m. domesticus mice trapped near Prague in 1974 and the Eastern Shore of Maryland in 1976, respectively (Beck et al. 2000).

Currently, over 350 pre-CC lines are in more or less advanced stages of development at three locations: Tel Aviv University, Israel (TAU) (Iraqi et al. 2008); University of North Carolina, USA (UNC) (Chesler et al. 2008); and Geniad Ltd, Western Australia (GND) (Morahan et al. 2008). In addition to TAU, UNC, and GND, The Jackson Laboratory and Oxford University also participated in the initial development of the CC resource lines. To facilitate community access to the CC, a material transfer agreement was executed among all five parties and can be obtained from any of them (Welsh et al. 2012). Genotypes are available at a dedicated Web site: http://csbio.unc.edu/CCstatus with a browser to facilitate visualization and interaction with the genomes of the individual CC lines: http://csbio.unc.edu/CCstatus/?run==CCV/.

Controlled randomization was performed during the breeding process to break up large linkage disequilibrium (LD) blocks and to recombine the natural genetic variation present in these inbred strains with the aim to create a unique resource of RI strains exhibiting a large phenotypic and genetic diversity (Roberts et al. 2007).

The CC population exhibits about fourfold map expansion compared with a single generation cross, increasing accuracy of QTL map location in proportion. Because of their inbred nature, all genetic traits involve homozygotes; thus, increasing genetic variation associated with each QTL (Falconer and Mackay 1996); there may, of course, be exceptions to this rule. In addition, multiple individuals can be phenotyped in each line reducing environmental sources of variation. In this way, the effective mapping power of the set of RILs is increased many-fold relative to standard F2 mapping populations (Valdar et al. 2006). Initially, all CC mice were genotyped with the mouse diversity array (MDA), which contains 620,000 SNP markers (Yang et al. 2009), and their genome reconstruction was presented (Durrant et al. 2011). Recently, all mice were regenotyped at advanced generations with the new 7500 custom-designed mouse universal genotype array (MUGA), which provided the genome architecture of the CC lines (CCC 2012). After six inbreeding generations, the entire CC mouse population was re-genotyped by the 77,000 SNP array of MegaMUGA, and all their genotypes will be available. Figure 1 shows the genomic reconstruction of three CC lines after genotyping with MDA (Durrant et al. 2011) by using Happy software (Mott et al. 2000).

Fig. 1
figure 1

Reconstructions of the genomes of representative CC lines. Genomes of CC lines IL-127 (upper panel), IL-134 (middle panel), and IL-135 (lower panel) were reconstructed using a hidden Markov model (HMM) implemented by HAPPY program (Mott et al. 2000). The X-axis shows the 19 autosomes. The Y-axis shows the eight CC founders, with probability of descent from each founder. Regions attributed with high probability to a single founder appear as dark horizontal bands in the lane corresponding to the founder. Regions where two or more putative founders cannot be distinguished are gray. Regions where a founder is not represented at all are white

With some exceptions, the founder line haplotypes are distributed more or less equally across the population of lines as a whole, although the distribution of founder genome within individual chromosomes or lines can differ widely from equality. Linkage disequilibrium decayed rapidly with distance as expected for a collection of independent inbred lines, and there were no indications of gametic disequilibrium (i.e., of LD among unlinked markers), which agrees with the report by Broman et al. (2012). Consequently, except for type I error due to sampling, marker–QTG association tests will be significant only between markers and QTN that are closely linked. This is in contrast to the situation found in panels of classical mouse GRP involving collections of inbred strains, where extensive long-range gametic disequilibrium is present due to historical relationships among the lines, and results in a much higher effective type I error than predicted by sampling considerations alone. Thus, the CC resource is well along the way to fulfill the most sanguine hopes of its community.

A recent study characterizing the genome architecture of 350 pre-CC lines (CCC 2012) showed that the wild mice indeed contributed enormous stores of genetic variation that were not found in the standard mouse laboratory strains. Whereas most classical strains differ from the reference C57BL/6J at about 4 million SNPs, PWK/PhJ, and CAST/Ei each differ at about 17 million SNPs, and WSB/EsJ at 6 million (Keane et al. 2011). Among them, the eight founder populations present 36.155 million SNPs! This is many-fold greater than the genetic diversity captured in existing GRPs derived from crosses or panels of standard laboratory strains. Consequently QTL mapping using the CC is likely to uncover novel QTLs involving contrasts between the wild-derived strains. This is exemplified in a pilot experiment in our laboratory in which we fine-mapped eight QTLs associated with post-challenge survival after infection by Aspergillus fumigatus. Of these, five QTL involved contrasts with wild-derived strains and would not have been present in a cross between classical strains (Durrant et al. 2011). That study, and others by our collaborators (Aylor et al. 2011; Durrant et al. 2011; Bottomley et al. 2012; Kelada et al. 2012; Kovacs et al. 2011; Mathes et al. 2011; Philip et al. 2011), further showed that by incorporating variation data from the genome sequences of the CC founders—available from the Sanger Mouse Genomes Project (Keane et al. 2011)—and restricting attention to variants whose differences across the founders are consistent with the pattern of action of the QTL (Yalcin et al. 2005), confidence interval of QTL location can be significantly narrowed, and the list of potential candidate genes markedly refined.

In the present report, we present extensive data primarily from our own laboratory, documenting extensive genetic variation among CC lines as expressed by the well-known heritability statistic, representing the proportion of total phenotypic variation that can be attributed to genetic factors, and by the well-known “coefficient of genetic variation (CVG),” which provides a unit-free measure of the relative magnitude of the genetic variation. In addition, we discuss previous studies from our laboratory and others demonstrating the ability of the CC resource to provide high-resolution mapping of QTL affecting a wide variety of traits, including susceptibility to a spectrum of infectious diseases. Finally, we consider the potential role of the CC as a uniquely powerful resource for systems biology.

Materials and methods

Heritability and genetic coefficient of variation analysis in the CC lines

Heritability estimates were obtained from unpublished data for a wide variety of traits presently being studied at TAU and from the published results of Philip et al. (2011) for selected morphological and behavioral traits recorded for the lines under development at Oak Ridge National Laboratory [(ORNL), later transferred to North Carolina State University]. Traits at ORNL were monitored from the initial crosses of the funnel breeding scheme (G1 generation), through the seventh inbreeding generation (G2:7 generation).

Brief descriptions of the study traits are given in Box 1. For TAU data, broad-sense heritability estimates (H2) including epistatic but not dominance effects were obtained from a one-way ANOVA with CC lines as the main effect as follows:

$$ H^{2} = V_{\text{g}} /\left( {V_{\text{g}} + V_{\text{e}} } \right) $$

where

Box 1 Data sets from TAU and ORNL for calculation of heritability and coefficient of genetic variation
V e :

is the environmental component of variance within lines = MSwithin

V g :

is the genetic component of variance among CC lines = (MSbetween − V e)/n

n:

is the average number of mice per line

For the ORNL data, narrow-sense heritability (h2) estimates (not including epistatic effects) were obtained by Philip et al. 2011 from parent–offspring regressions.

The genetic coefficient of variation

The heritability statistic estimates the proportion of observed phenotypic variation that is due to genetic factors. However, it does not tell us whether the absolute amount of genetic variation generated by these genetic factors (the “genetic component of variation”) is great or small. A high-heritability value is compatible with very little absolute genetic variation (if total phenotypic variation is also very low and mostly due to genetic factors), while a low heritability value is compatible with a large genetic component of variation (if phenotypic variation is very large). The absolute value of the genetic variation is readily obtained as the genetic standard deviation (Vg0.5). This value, however, depends on the measurement unit of the trait and is not meaningful for comparison among traits. The coefficient of variation (ratio of standard deviation to the mean) is a commonly accepted unit-free measure of dispersion. We used the well-known evolvability parameter, the ratio of the genetic standard deviation to the mean, (also termed the genetic coefficient of variation, CVG), as the comparable measure for unit-free evaluation of genetic dispersion (Garcia-Gonzalez et al. 2012; Houle 1992).

It is of interest to give a benchmark for judging whether CVG values are large or small. As a rough estimate, within any given outcrossing population, quantitative traits generally have a phenotypic coefficient of variation equal to about 10 % of the mean, i.e., SD/mean = 0.1. If we consider a trait with heritability 0.50, the genetic variance will be 0.5 of the phenotypic variance, and the genetic standard deviation will be the square root of this, or 0.71 SD. Thus, the CVG will be about 0. 071. Thus, comparison of CVG among the CC lines with this benchmark will tell us to how the genetic variation among the CC lines compares to that found for many quantitative traits, within a typical outcrossing population.

For TAU data, CVG was estimated as

$$ {\text{SD}}_{\text{G}} /{\text{Mean}} $$

where

SDG :

 = the broad-sense genetic standard deviation among CC lines = V 0.5G

Mean:

 = mean trait value across all CC lines

For ORNL data that provided information on narrow-sense heritability, SDA was calculated from values for mean, h2, and SD presented in Table 2 of Philip et al. 2011.

Results and discussion

Here, we present selected phenotypic profiles of traits that were characterized on TAU CC lines. Figure 2 shows total polyps variation among different (Apc Min− x CC) F1 mouse lines. Figure 3 shows variation of blood glucose level during intraperitoneal glucose tolerance test (IPGTT) assessments of different CC lines. Figures 4 and 5 show distribution of body lengths and body mass index (BMI) of 20 week-old mice of different CC lines, respectively. Figure 6 describes percentage of T, B, and macrophage immune cells in different naïve CC mouse lines.

Fig. 2
figure 2

Variation among CC lines: modifiers of familial adenomatous polyposis (Apc) gene. Mean polyp count in F1 cross of CC line x Apc Min−. The X-axis represents the 15 tested CC lines (2–4 mice per line) while the Y-axis represents the mean polyps count (with standard error) in small intestines and colon at age 5 months. Mean polyp count in control C57BL/6J (APCmin+/−) mouse is also presented

Fig. 3
figure 3

Variation among CC lines: intraperitoneal glucose tolerance test (IPGTT). Blood glucose levels during an IPGTT, performed on 11 CC mouse lines (4–6 mice per line), maintained on high-fat diet for a period of 12 weeks. The X-axis represents the time points (min) used for testing blood glucose levels (mg/dL). Blood glucose levels (with standard error) are represented on the Y-axis

Fig. 4
figure 4

Body length of 20 week-old mice of CC lines. Body length was defined the distance between nose and anus. The x-axis represents the 15 tested CC lines while y-axis represents the mean body length (cM) with standard error included. On average 3–6 mice per line were assessed. Mean body lengths of the different tested CC lines were ranged between 8.2 and 10.3 cM

Fig. 5
figure 5

Body mass index (BMI) of CC lines. BMI was defined as body weight divided by the square of body length (BW/BL2), at 20 weeks of age. The X-axis represents the 15 tested CC lines (3–6 mice per line), while the Y-axis represents the BMI (with standard error)

Fig. 6
figure 6

Variation among CC lines: percentage of T, B, and macrophage immune cells. The X-axis represents 15 naïve CC mouse lines (5–7 mice per line), while the Y-axis represents the percentage (with standard errors) of the three tested immune cells in peripheral blood at age of 8–9 weeks

Table 1 presents heritability and genetic coefficient of variation values for a variety of traits studied at TAU and obtained from ORNL data. For the TAU traits, H2 values are generally in the range of 0.10–0.40; for ORNL, h2 values were higher, in the range of 0.37–0.89. In both data sets, CVG or CVA was in the range 0.15–0.60, much higher than the benchmark of 0.071. Thus, these data show that the absolute magnitude of genetic variation among the CC lines is much greater than found within typical outcrossing populations. This can be attributed to the inclusion of the three wild subspecies among the founder parents. It also suggests that QTL with very large effects may be segregating in the CC lines. This is supported by the initial QTL mapping studies that were conducted using CC lines in various stages of development (“pre-CC lines”). The experiments are listed with brief descriptions in Box 2. These QTL mapping studies (Table 2) were remarkably successful in mapping QTL for a variety of traits to very narrow confidence intervals, using population sizes much smaller than would normally be required to achieve such results. In a few cases, resolution was sufficiently precise to point to one or a very small number of candidate genes for the QTN.

Table 1 Collaborative Cross, genetic variation among lines
Box 2 QTL mapping experiments
Table 2 Collaborative Cross, QTL mapping

This unexpected mapping effectiveness, relative to the Valdar et al.’s (2006) simulations can be attributed to a number of factors:

(1) As noted, the CC includes three wild-derived strains, representing three Mus musculus subspecies. Apparently, these introduced genetic variants having much stronger effects than those uncovered in QTL analyses of the standard mouse inbred lines, or in domesticated livestock and poultry.

(2) QTL mapping in the CC resource is based on an eight-allele founder haplotype model, instead of a simple marker-based association test. This avoids confounding of different linkage phase of marker allele and QTL allele in different founder populations.

(3) Sorting parental lines according to allele effect of the QTL identifies regions of similar effect and similar haplotype shared by two or more founders, and contrasting to the effects and haplotypes of the other founders. This limits the location of the QTL to the chromosomal region and marker haplotype that is common to the founders sharing the QTL allele. The reduction in confidence interval of QTL location by this means can be dramatic.

These studies confirm that by phenotyping a relatively modest number of CC lines (around 100 lines), with sufficient replication, it is possible to map QTLs to a resolution of about 1 Mb (Aylor et al. 2011; Durrant et al. 2011; Philip et al. 2011; Bottomley et al. 2012; Kelada et al. 2012), subsequently leading to identification of strong candidate genes. Knockouts of candidates within QTL can then be used to confirm function. Deep RNA sequencing (Ansorge 2009; Ng et al. 2010; Voelkerding et al. 2009) can also be employed to identify the pathways activated by these genes. We believe that these achievements cannot be obtained with any other currently available mouse resource populations.