Keywords

1 Introduction

1.1 What Is Positional Cloning?

Positional cloning is a technique that identifies a trait-associated gene based on its location in the genome and involves methods such as linkage analysis, association mapping, and bioinformatics. In reality most traits are regulated by multiple loci called quantitative trait loci (QTL). The objective of QTL mapping is to identify genomic regions associated with a trait. In the absence of a simple relationship between a QTL and a trait, QTLs are mapped by linked genetic markers that segregate in a Mendelian fashion and can be unambiguously determined. If a QTL is adjacent to a genetic marker, the phenotypic values for the trait will differ between genotypes at the specific marker. The difference in phenotypic value between genotype groups will be larger the closer together the QTL and marker are, and it will reach a maximum when the marker exactly coincides with the gene. Given a genetically segregating population that originates from strains that are variable for the phenotype, the whole genome can be systematically tested for the presence of a QTL through evenly spread informative markers (i.e., polymorphic between the strains used to create the mapping population). The segregation of markers with the QTL and association with the trait are then statistically modeled to provide evidence for the presence of a QTL. This approach can be used even when little is known about the molecular basis of the trait.

1.2 Experimental Populations

Localization of QTLs is possible in a mapping population, which allows association between a genetic marker and the trait. In practice it most often starts with a selection of suitable inbred strains. The difference in the phenotype between inbred strains (each genetically identical and homozygous) kept in the same environment is solely a reflection of differences in variants of their genes. The main factor determining the precision of the QTL position is the recombination rate. Recombination events occur during the pachytene stage of meiosis through the introduction of breaks and exchange of material between chromatids. A variety of different breeding strategies leads to the remixing of parental genomes and creation of new genotype combinations. The use of crosses between inbred strains provides high power to detect linkage due to one round of recombination and all individuals are informative. Intercross (F2) and backcross (BC) populations are most frequently used in QTL mapping. An F2 cross generated by interbreeding F1 hybrids (heterozygous throughout the genome) is more informative and allows identification of both dominant and recessive QTLs in the same cross compared to a BC. The latter, generated by interbreeding F1 hybrids with one of the parental inbred strains, is more suitable than an F2 for mapping QTLs of major effect (1) and parent-of-origin effects (2). A few recombination events in both F2 and BC offspring allow a whole genome scan with a limited number of genetic markers. To achieve higher resolution an efficient approach is to increase the actual number of recombination events without a significant increase in population size. This can be achieved by creating an advanced intercross line (AIL). An AIL is created by randomly intercrossing two inbred strains for several generations, avoiding sister-brother mating (3). This scheme generates genetically unique offspring with a dense mixture of founder chromosomal fragments. Intercrossing causes accumulation of recombination events leading to more precise QTL mapping. The most important requirement is a minimum of 100 individuals in the breeding population in each generation to reduce allele fixation by genetic drift. Even with this size the exponential decrease in confidence interval will continue until approximately the tenth generation, after which it will display a rather stable course. An additional requirement is a proportionally higher marker density necessary to achieve an equivalent power to detect QTL as in an F2 cross. To achieve a higher magnitude of resolution an efficient approach is to increase both the number of founding genomes and the number of recombination events, although this requires an increase in population size. A heterogeneous stock (HS) is a population that harbors recombinants derived from inbred strains that have accumulated over many generations of outbreeding to create a genetic mosaic. An HS is established from eight inbred strains (4, 5). By using a standard pseudo-random outbreeding schedule, inbreeding is minimized and recombination density is maximized to reduce the size of inherited haplotypes (6). This approach also reduces allele fixation by genetic drift, which enables breeding the population beyond 50 generations to accumulate recombination events and mix the founding genomes. Therefore, an HS can provide mapping resolution that is exponentially higher than an AIL, allowing fine mapping of QTLs to intervals smaller than a cM (7). Theoretically, it is possible to perform a genome-wide association study of complex traits in the HS, to identify and fine-map QTLs in the same population. Thus, this approach combines the gene identification step usually performed in BC/F2 populations with the fine-mapping step usually done in interval-specific recombinant congenic strains and advanced populations (8). The size of the mapping population required for QTL detection depends on the phenotype variance in the population and the number of QTLs and their effects, which is difficult to predict. In general, a larger population gives a greater chance of detecting several QTLs, even those displaying small effects. Overall, as genetic complexity increases to match complex phenotypes, experimental control and statistical simplicity decrease. The populations that offer the most control and statistical confidence are also the most genetically homogeneous populations (inbred and congenic strains), and therefore the most artificial models. On the other hand, populations that offer the highest probability of capturing the complexity involved in multifactorial traits are heterogeneous populations (HS) and are therefore more sensitive to the influence of family structure and confounders. To cope with this, rigorous statistical analyses using stringent criteria are necessary. This may partly explain why second generation crosses, which are in the middle of the complex genetics-statistical control continuum, are so widely used in genetic research.

1.3 Genetic Markers

A genetic marker is any polymorphic Mendelian character that could be used to follow a specific genomic location (i.e., to establish the strain of origin of inherited allele). The development of DNA markers in the early 1980s facilitated the era of whole-genome screening. PCR amplification of microsatellite markers revolutionized genomic screening. Microsatellites are simple di-, tri-, or tetranucleotide repeats. They are highly heterozygous, easy to score by standardized PCR, and relatively frequent (every 10–20 kb). The newest generation of markers, i.e., single nucleotide polymorphisms (SNP), despite their bi-allelic form are introduced due to the ultrahigh-throughput genotyping capabilities and high density (approximately >3 million SNPs between any two given strains). The development of dense genetic maps in different organisms followed the discovery of new markers and suitable detection methods. The spacing of markers required to achieve power enough to detect QTLs depends on the QTL effect and the type and size of the mapping population. Marker spacing of 10–25 cM is often accepted in F2 and BC populations. To achieve the same power in an AIL as in an F2 or BC, marker spacing of 1–5 cM is required. Microsatellites are still the genetic markers of choice in the rat. They are numerous and can provide <1 cM coverage in certain strain combinations. Detection is based on well-standardized PCR and linkage and physical maps are available. Despite more than 30 rat strains being sequenced and rapid SNP discovery, development of inexpensive SNP genotyping methods in the rat is still in its infancy.

1.4 QTL Identification

Once the phenotype and genotype of each individual in a mapping population have been determined, statistical tools to identify the existence, location, and significance of QTLs are applied. There is a distinction between methods testing single markers and intervals as well as methods assuming the existence of one or multiple QTLs. These different methods have advantages and disadvantages to consider when selecting the appropriate model to be used, which will be briefly discussed. Model selection may never be perfect, but choosing an incorrect model can be detrimental. The simplest method performs analysis at a single marker (9). It compares phenotypic expression between groups of experimental animals stratified according to their genotypes at a given marker using a simple t-test, analysis of variance, or any nonparametric test. If the genotype groups differ in phenotype there is an indication that the marker is linked to a QTL. If they are the same the marker is not linked to a QTL. The main advantage of this method is simplicity and no need for a genetic map (i.e., the order of genetic markers and distance between them) and special software. A particular strength of the method is that it can be easily extended to account for multiple covariates and QTLs. The main disadvantage is that the QTL location is imprecise because it depends on the marker location and estimated effects, assessed on marker genotypes, are always smaller than in reality. Furthermore, the method suffers from decreased power since all individuals with missing genotypes have to be excluded from the analysis. This has a most prominent influence when genotyping is performed on a coarse scale. We still use single-marker tests to confirm our findings and rule out the possibility of false positives due to coarse genotyping, incorrect marker order, or violation of trait normality. Interval mapping (IM), developed by Lander and Botstein (10), is the most popular method for QTL analysis. Testing for a putative QTL is performed by “walking” along the genetic map. It is possible to calculate the probability of an individual’s genotype at a putative location, depending on the genotypes of the nearest flanking markers. Thus additional information is gained from the relationship between markers. The strength of evidence is measured by LOD scores (“logarithm of odds” favoring linkage). A LOD score measures the strength of evidence for the likelihood of QTL presence given the data compared to the likelihood of no QTL at the position. Maximum likelihood (ML) is obtained where parameters are estimated to give the highest probability for the observed data (maximal value). ML uses a reiterative process of associating a trait to a genomic location based on probability and then reevaluating linkage with newly created information until a QTL is detected. IM has several advantages: (1) QTL can be more precisely localized, (2) estimation of the QTL effect is greatly improved, and (3) missing genotypes and errors are accounted for. Statistical stringency is used to correct for the multiple tests performed in order to decrease the risk of false positive QTLs. The relative disadvantage is heavy computations that require specialized software. Haley and Knott have developed a multiple regression method that remarkably well approximates IM (11). Moreover, it could easily be extended to multiple QTLs and covariate analyses while being much less computationally intense. Both single-marker and IM analyses assume a model with a single QTL. Mapping multiple QTLs simultaneously has several important advantages: (1) increase in power, (2) separation of linked QTLs, and (3) mapping epistatic interactions. Epistatic interactions are largely unpredictable and multiple QTL models might help in mapping regions that would never come up in single-QTL analysis. The p-value reflects the probability of obtaining an equal or larger LOD score than that observed in the case where there are no QTLs in the population. Very small p-values indicate that the QTL really exists. Corrections must be made for testing a hypothesis on a number of locations. Lander and Kruglyak performed extensive simulations and derived rather stringent but robust criteria for suggestive and significant linkage of 2.8 and 4.3 LOD scores, respectively in an F2 intercross with 2° of freedom (12). Another very attractive approach introduced by Churchill and Doerge is the permutation test (13). The use of permutation provides thresholds for significance specific for the performed experiment. It is based on the assignment of mismatched phenotypes and genotypes, recording the acquired maximum LOD scores and estimating how often a certain LOD score occurs in the population. The conventional permutation method cannot be used to set the significance thresholds in the AIL, due to the different family structure compared to F2 generations for which it is developed. To set significance thresholds in an AIL, the within family variance (inheritance of phenotype with the causing genotype, i.e., linkage) is removed to determine LOD scores for between-family variance (representing random effects, i.e., no linkage) (14). Further confirmation of true QTL existence comes from reproducibility in independent experiments. Among many available software, we will focus on R/qtl (15). The R/qtl software introduced a number of options for QTL mapping and it allows both single- and two-QTL mapping using ML, Haley-Knott regression, and multiple imputation methods. Furthermore, R/qtl provides other models such as binary, nonparametric, and two-part models. The nonparametric model is based on Kruskal–Wallis statistics and is well suited for phenotypes not fulfilling normality criteria. In the case of a spike in the phenotype distribution the two-part model, which is a combination of binary and nonparametric models, provides maximum extraction of information. Moreover, models assuming the existence of more QTLs and/or interactions between them can be tested with multiple imputation. An additional feature is the permutation test for setting up the threshold for significant linkage. At the moment, R/qtl provides more methods and models than any other software. The association studies in HS involve a more specialized statistical approach to correct for the population structure created by the different degrees of genetic relatedness between individuals (16). The multitude of methods developed for genome-wide analyses in classical intercrosses is not applicable to this population (1719). Therefore, novel analytical methods and statistical packages were developed specifically for the population to detect haplotype association (7, 8), Resample Model Averaging and Mixed Models. For normally distributed phenotypes, the preferred method is the Mixed Models approach (20), in which the population structure and genetic relatedness is corrected to reduce false positives in association mapping. Pairwise genetic relatedness is incorporated into the statistical model to account for that two genetically similar individuals are more likely to have correlated phenotypes than two genetically dissimilar individuals. The haplotype reconstruction phase of analysis is carried out using the R/HAPPY software. The QTLs are then fitted using R/EMMAX. The false discovery rates of identified QTLs are calculated to determine significance. For non-normally distributed phenotypes, the Resample Model Averaging method is used. Parental haplotypes are reconstructed, using a hidden Markov-chain approach, to predict probabilities of inheritance from each of the eight progenitor strains for each SNP. A multiple QTL model is then fitted using a model averaging method to obtain a posterior probability that a QTL will be included in the model (7). This is accomplished by repeatedly resampling the data and in each resample test which set of markers best explains the variation in the phenotype. Hence, the association between phenotype and genotype at any one locus is corrected by the pattern of associations over the rest of the genome. The haplotype reconstruction phase of analysis is carried out using the R/HAPPY software. The QTLs are then fitted using R/Bagphenotype.

1.5 Validation of a QTL and Functional Testing

The size of a QTL identified in F2 or backcross usually ranges from 10 to 30 cM, containing several hundreds of genes that cannot all be evaluated. Above all, QTL mapping represents a statistical approximation of both the existence of QTL and it's genomic location and requires further confirmation before enormous funding and effort are put into gene identification. One approach that can solve both problems is mapping using congenic strains. A congenic strain is an inbred strain in which one part of the genome has been transferred from one strain (donor) to the other (recipient) by repetitive backcrossing to the recipient strain and by selection of animals having the region of interest from the donor strain. Ten generations of backcrossing are used as a standard to remove most of the contaminating donor genome outside the region of interest. The obvious weakness is time (3–4 years in rats) necessary to develop a congenic strain. The “Speed congenic” approach represents marker-assisted breeding in which a selection of animals that contain the region of interest is additionally selected for containing the lowest percentage of contaminating donor genome (21). A simulation analysis demonstrated that 16–20 males screened every 25 cM can create a congenic strain in five generations with an equivalent level of contamination as compared to the conventional ten generation backcrossing strategy. The use of higher marker density and number of males does not influence the rate of removal of the contaminating donor genome. The phenotypic difference between the parental and congenic strains, which only differ in the transferred region, demonstrates the influence of the QTL as identified in linkage analysis. Furthermore, by backcrossing the congenic with the recipient strain and selecting recombinants within the region, correlation of gradually smaller regions with a trait can narrow down the QTL allowing final gene identification (22). Correlation of the region and/or gene with a subphenotype may give insights into mechanisms of gene regulation and trait development. Extensive analysis could be performed repeatedly in genetically identical material that is easy to reproduce.

1.6 From QTL to QTN Identification

Although the objective of QTL mapping is to identify genomic regions associated with a trait, the ultimate goal is to identify the gene and the SNP(s) (quantitative trait nucleotide, QTN) or haplotype that is responsible for the trait. By discovering the function of causative SNPs or haplotypes we can understand the molecular changes that lead to the phenotype. The genomic sequences can be exploited to identify QTNs. The limiting factor is genetic resolution, and this type of analysis requires either more a priori information or several generations of intercrossing and a large cohort to test in order to generate sufficient resolution. In the simplest of cases, a candidate gene is already established and the task is to identify the responsible nucleotide. Based on this information, the sequence differences between inbred strains or congenic and parental strains can elucidate the molecular change for the candidate gene to generate a hypothesis for testing. In most cases, more complex cross populations are used to identify QTN, much in the same manner as described above for QTL identification. One strategy that can be used to fine-map a previously identified region of interest is a Partial Advanced Intercross (PAI). The genomic region is first captured in a congenic strain that is then used to breed an intercross that only varies in the congenic region. The population is phenotyped and the SNPs in the variable region are typed to identify QTN by linkage analysis. The approaches described above are used to study a particular gene or genetic region, often based on a previous finding. However, QTN can also be identified on a genome-wide scale. To identify causative variants for genes and pathways associated with a trait, sequence-based and genetic mapping approaches can be combined. Genetic mapping on fully sequenced individuals is transforming our understanding of the relationship between molecular variation and trait. The HS has three characteristics that make it particularly well suited for this type of analysis: (1) the genetic resolution enables direct identification of risk genes; (2) the complete genomic sequence of genotyped HS animals can be imputed with high accuracy from the progenitor genomes, and (3) the population’s well-defined haplotype space can be exploited to determine whether genetic association is caused by single variants or by haplotypes (23). This type of analysis requires knowledge about the segregating SNPs in the population. The HS permits a test, called merge analysis, of whether an SNP is responsible for the phenotype, or the combination of variants from a single progenitor, i.e., a haplotype, is causal. The test is possible because the haplotypes segregating in an HS are known (the stock is derived from the eight sequenced genomes). Any imputed variant that exceeds the maximum haplotype logP is termed a candidate causative SNP. The near complete sequence allows identification of multiple QTNs at a locus in addition to haplotypic effects. Defining a catalogue of QTNs enables the study of their effects on protein, mRNA, and gene regulation levels to understand their molecular genetic mechanisms. Manipulation of genes (using various gene targeting approaches) affected by the QTNs enables the study of their effects on the trait.

2 Materials

  1. 1.

    Experimental population. Inbred strains that will be used to generate experimental mapping population should be selected based on the phenotypic and genotypic differences. Inbred strains are families of animals where all members are genetically identical, or very close to identical. This is achieved by breeding brother and sister pairs for a minimum of 20 generations, which should achieve more than 99 % identical genome (24). Extensive information about phenotypic features of rat strains and genetic diversity between them can be found at the Rat Genome Database, RGD (http://rgd.mcw.edu/).

  2. 2.

    DNA extraction. Materials for DNA extraction will depend on the selected method. Any method or kit that isolates genomic DNA and avoids excessive fragmentation of DNA is suitable (e.g., conventional DNA precipitation with isopropanol or phenol–chloroform extraction).

  3. 3.

    Phenotyping. Materials for phenotyping depend on the equipment and assays needed for measuring the phenotypic expression of the trait that will be studied.

  4. 4.

    Genotyping. For the genotyping of microsatellite markers, consumables and equipment that are specific to the preferred PCR reaction or kit and the available capillary sequencer are required. We recommend multiplexing microsatellite markers, which can be done with for example Type-it microsatellite PCR kit (Qiagen). Although there are many machines and software available for fragment analysis, we have good experience with ABI 3730 DNA analyzer and GeneMapper® Software (Applied Biosystems). The genotyping of SNPs is often based on array methodology and thus performed commercially or in specialized laboratories.

  5. 5.

    Software. We recommend R/qtl software (15). Instructions to install R and the add-on package qtl and perform analysis can be found at http://www.rqtl.org/.

3 Methods

3.1 Strain Selection

Select inbred strains with the largest possible difference in the phenotype of interest and that are genetically diverse. Prior to the breeding, animals should be kept in the new facility for a minimum of 1 week to adjust. Breeding animals should be of an appropriate age, e.g., around 2 months old (rats or mice).

3.2 Experimental Population

Breed a desired experimental population. Backcross or intercross populations are suitable for determination of the number of QTLs and their location in the genome. All genotypes are obtained from the same set of parents, which means that the family structure of the population is known and allows straight and accurate linkage mapping without adjustments to the data, and statistical methods can reliably calculate significance thresholds (25). The mapping resolution in an intercross can be slightly better compared to a BC, because both chromosomes in each pair are allowed to recombine. On the other hand, a backcross is better suited for parent-of-origin analysis. However, QTLs detected in backcross and intercross populations comprise large intervals and often contain hundreds of genes. Higher mapping resolution is required for candidate gene selection and can be achieved in advanced intercross lines and especially in heterogeneous stocks. For example, using the 10th generation AIL (G10) provides an approximate 3.5- to 5-fold increase in resolution compared to an F2 intercross. An HS can provide mapping resolution that is exponentially higher than an AIL, and can thereby combine the gene identification step usually performed in BC/F2 populations with the fine-mapping step usually done in complex populations and congenic strains (8). A disadvantage in using a more complex population is that it does not meet the assumptions of established statistical methods. The statistical approach used to map linkage and establish significance thresholds and confidence intervals must be adjusted for the underlying population stratification.

3.2.1 Backcross (Fig. 1)

N2 backcross (BC) populations are created by backcrossing F1 hybrids to one of the parental inbred strains (26). By using an F1 hybrid as one parent and an inbred strain as the other, it is possible to determine which parent heterozygous alleles were inherited from (27). Additionally, by using F1 hybrid mothers in half of the population and F1 hybrid fathers in the other half (a reciprocal cross), a population is created in which the parental origin of trait-predisposing alleles can be established to determine parent-of-origin effects on inheritance (maternal, paternal, or shared) (2). Furthermore, by creating one reciprocal cross with the susceptible strain and another with the resistant strain, all three possible genotypes are obtained to enable most allelic effects to occur.

Fig. 1
figure 1

Schematic illustration of the breeding setup used to create a reciprocal backcross population (BC). One pair of autosomes is represented by vertical lines and mitochondria are represented by circles. A BC population is useful for mapping QTLs, while also mapping parent-of-origin effects (mitochondria, sex chromosomes, and QTLs that depend on parental origin). Different breeding setups will allow these factors to vary (mitochondria, sex chromosomes, and maternal/paternal genotype origin), depending on the study aim. In this example, we are backcrossing F1 hybrids to strain B (allowing parental origin of B and W to vary) while keeping mitochondria fixed to B

Example: To create the F1 generation, establish four breeding pairs. For reciprocal breeding, establish two pairs with B female founders (F1B) and two pairs with W female founders (F1W). The reciprocal N2 generation is created from B (n ≥ 4) and W (n ≥ 4) females bred to F1 males and F1 females bred to B (n ≥ 4) and W (n ≥ 4) males. If reciprocal founders are used, a minimum of 16 pairs with F1B and 16 pairs with F1W are set up. Several litters from each pair can be used for experimentation.

3.2.2 Intercross (Fig. 2)

Alternatively, all three genotypes can be directly achieved in a population by mating two F1 hybrid parents. This is called an intercross and the most commonly used generation is the second (F2) (28).

Fig. 2
figure 2

Schematic illustration of breeding design for intercross (F2) and advanced intercross line (AIL) construction. One pair of autosomes is represented by vertical lines and mitochondria are represented by circles. An intercross captures all three possible genotypes in one population, which is useful for mapping dominant, additive, and recessive QTLs and QTL interactions. To create an F2 population, F1 hybrids are intercrossed. To create an AIL, intercrossing continues for additional generations, avoiding sister-brother mating

Example: Create the F1 generation by the reciprocal breeding described for BC (under Sect. 3.2.1). The F2 generation is created from eight pairs with B female founders (F1B) and W male founders (F1W) and eight pairs with W female founders (F1W) and B male founders (F1B). Several litters from each pair can be used for experimentation.

3.2.3 Advanced Intercross Line (Fig. 2)

An AIL can provide a more precise QTL location and can reduce the interval it spans (3). The population is created much the same as F2 intercross populations, with the crucial difference that the breeding is continued in a pseudo-random fashion for several generations.

Example: Create the F1 and F2 generations described for intercross (under Sect. 3.2.2). The G3 generation is created from breeding couples (n ≥ 50) with both types of female founders (25 pairs each). Random breeding of 50 males and females, avoiding sister-brother mating (n ≥ 50), creates all subsequent generations.

3.2.4 Heterogeneous Stock (Fig. 3)

The HS was established from eight inbred strains (4). The stock has been bred according to a standard pseudo-random outbreeding schedule more than 50 generations, using 40 breeding pairs for each generation. The breeding scheme is designed to minimize inbreeding and maximize recombination density to reduce the size of inherited haplotypes (6).

Fig. 3
figure 3

Schematic illustration of heterogeneous stock (HS) construction. One pair of autosomes is represented by vertical lines. Eight inbred strains are intercrossed for more than 50 generations in a pseudo-random fashion to create genetic mosaics

Example: There are HS colonies already established in both rat and mouse. The rat HS was established by Dr. Carl Hansen at the National Institutes of Health (NIH) in the 1980s from ACI/N, BN/SsN, BUF/N, F344/N, M520/N, MR/N, WKY/N, and WN/N (4). The MR, WN, and WKY strains trace their ancestry to the original Wistar stock, the ACI strain is a hybrid between the August and Copenhagen strains, the BN strain traces its ancestry to the Wistar Institute stock of wild rats, and the M520, F344, and BUF strains are of unknown origin. The European HS colony was established in 2004 by Dr. Alberto Fernandez Teruel at the Autonomous University of Barcelona obtained from the Northwestern University colony (Dr. Eva Redei). The mouse HS was established from A/J, AKR/J, BALB/cJ, C3H/HeJ, C57BL/6J, CBA/2J, DBA/2J, and LP/J (29). The AKR, C57BL, C3H, and BALB strains were originally obtained from Charles Rivers Laboratories (Wilmington, MA) and A, CBA, DBA, and LP strains were obtained from Jackson Laboratories (Bar Harbor, ME). The stock was created by Dr. Robert Hitzemann in the 1980s and is currently maintained in his laboratory at the Oregon Health & Science University. The HS colonies are maintained using a circular breeding design, i.e., family one male is bred to family two female, family two male is bred to family three female, etc., to keep the genetic heterogeneity while reducing allele fixation. New HS populations can also be created by eight-way intercrossing inbred strains, but generating new stocks demands a lot of time and resources and we recommend using existing HS if possible.

3.3 Phenotyping

Measure your phenotype of interest in the entire experimental population. Measurements should be done as accurate as possible and all factors that can affect the phenotype should be appropriately recorded (e.g., sex, age, set, season, experimenter, reagents batch number, etc.). Collect relevant tissues at the end of experiment for potential follow-up studies.

3.4 Genotyping

Select genetic markers that cover the genome on an appropriate interval. Marker information (position and primer sequences) can be found at www.ensemble.org and www.rgd.mcw.edu (which also provides information about strain differences). To be informative, a marker must be polymorphic between the parental strains, i.e., have different numbers of repeats or different nucleotide (A, T, C, G). The effect of the QTLs being mapped, together with the type and size of the population used, dictates the appropriate interval for marker spacing to achieve power to detect the QTLs. In general, marker intervals of 10–25 cM are appropriate for intercross and BC populations and approximately 1–5 cM marker intervals are appropriate for comparable power in the AIL. The HS requires 100 times more markers than a cross from inbred strains to allow haplotype reconstruction (16). Extract genomic DNA from tissue biopsies, i.e., tail tips, ear clips, or any other tissue collected at the end of experiment (see Note 1 ). Genotype markers using your selected assay and protocol (see Note 2 ).

3.4.1 Microsatellites

Genotypes can be determined by PCR amplification of microsatellite markers. Microsatellites are highly heterozygous di-, tri-, or tetra-nucleotide repeats that differ between different inbred strains in the number of repeats and thus the size of the fragment amplified using primers that anneal to the unique DNA sequence flanking the repeat region. Fluorophore-conjugated primers are used and PCR products are size fractionated on capillary sequencer (see Note 3 ). Genotypes are analyzed using software; however, we recommend manual confirmation of genotypes for quality assurance.

3.4.2 SNPs

SNPs are bi-allelic base pair substitutions that occur with high density (~800 bp) that enable ultrahigh-throughput genotyping and development of dense genetic maps. Select SNP markers based on the strain sequences you are using and purchase/design SNP assays to use for PCR amplification. Allelic discrimination can be performed in the lab using fluorescence-based technology, i.e., TaqMan SNP genotyping assays (Applied Biosystems). More often, SNP genotyping is performed using a custom array (i.e., Affymetrix RATDIV array for rat HS) commercially or in specialized laboratories. Briefly, the array interrogates several hundred thousands of SNPs chosen based on sequence data for your strains of interest.

3.5 QTL Identification

Once the phenotype and genotype data are compiled for all individuals in the population, the likelihood of existence, location, and significance of QTLs is statistically determined by applying a model to the data. For a quick or preliminary test to scan your data for QTLs, we suggest using single-marker tests. This simple method is quick, requires no special software or need for a genetic map. More comprehensive analysis (given under Sections 3.5.2 , 3.5.3 , and 3.6 ) might require additional training or statistical and bioinformatics assistance.

3.5.1 Single-Marker Tests (Fig. 4a)

Let us consider an experiment in an F2 cross. Group animals into three groups according to their genotype (BB, BW, and WW) and compare phenotypes between the groups. Select the appropriate test for your data. ANOVA can be used if phenotypic values show normal distribution, while nonparametric tests are better suited for phenotypic values that deviate from normal distribution. A significant difference between genotype groups indicate that the marker is linked to a QTL and warrants more in-depth analysis (described below). Repeat this for every marker to identify all potential QTLs. A threshold for significance has to be established with more detailed analysis that takes into account the population structure, number of markers, individuals, and QTLs. For quick inspection we would consider everything with p < 0.01 as potentially interesting (given further follow-up).

Fig. 4
figure 4

Different approaches are used for QTL identification. (a) A single-marker test compares phenotype values between animals grouped according to their genotype (BB, BW, and WW). In this example, animals with genotype BB at marker 5 express higher phenotype than animals with BW or WW genotype, indicating that this marker is linked to a QTL. (b) Interval mapping scans for a putative QTL along the genetic map, thus adding information between markers. The genetic map for chromosome 12 is shown on the x-axis and the LOD scores, which measure the strength of evidence, are shown on the y-axis. The QTL is most likely to be located at marker 5 (highest LOD score), with the 95 % confidence interval between marker 3 and marker 6. (c) QTL location and confidence interval are more precisely estimated in populations with higher genetic resolution (higher recombination frequencies). An F2 intercross (red) identifies broad QTLs that contain many genes, while AIL (blue) and HS (black) maps narrower QTL intervals

3.5.2 Interval Mapping (Fig. 4b)

This method often entails heavy computations that require specialized software. There is a variety of software packages that can be used, and we will base our description on R/qtl, which is freely available (15). IM requires a genetic map, i.e., chromosomes and locations of markers, either physical based on the genomic sequence (Mb) or linkage based on recombination fractions in the population (cM). LOD scores are then generated in a reiterative process of associating the phenotype to genomic locations along the map and then re-evaluating linkage considering the newly created information until a QTL is detected. QTL are more precisely localized by this method, and missing genotypes and errors are accounted for to preserve power while multiple test corrections decrease the risk of false positive QTLs. Select the appropriate interval between steps for the analysis. In general, a BC or F2 population rarely has dense enough recombinations to warrant smaller steps than 5 cM while an AIL has accumulated recombinations and therefore warrants tighter mapping, usually 1–2 cM (depending on the size of the population). A good rule-of-thumb is that at least 1 % of the population should have recombined between two tested positions, which equals 1 cM in distance. Select the model to be used for QTL analysis. Please see the instructions for your software package regarding the models included. In R/qtl, standard interval mapping can be performed using the em model, while the simplified Hailey-Knott regression gives a very good approximation of em for normally distributed data. Other regression models include nonparametric regression (non-normally distributed data), binary model (yes/no data), two-part model (a combination of binary and nonparametric models for data containing a spike in the phenotype distribution), and imputation where missing genotypes are imputed based on surrounding marker genotypes. Once you have analyzed your data, select the most appropriate method for setting significance thresholds and confidence intervals. For BC and F2 intercross, standard methods of permutation and bootstrapping can be used. Permutation provides significance thresholds that are specific for the study. Essentially, the genotypes and phenotypes are mismatched before QTL analysis and the maximum LOD scores are recorded for a series of analyses (usually 1,000–2,000 but best 10,000) to estimate how often a certain LOD score occurs by chance in the population. The conventional significance threshold is 95 %, but other stringency can be used if desired. To account for the family structure in AIL, family residual values can be used to calculate significance thresholds. The within family variance (inheritance of phenotype with the causing genotype, i.e., linkage) is removed to determine LOD scores for between-family variance (representing random effects, i.e., no linkage) (14).

3.5.3 Association Mapping in HS (Fig. 5a)

Genome-wide SNP information can be used for genome-wide association studies (GWAS, or association mapping), given that the genetic resolution in the population supports such dense analysis. The GWAS in HS involve novel analytical methods and software packages developed specifically for this population (7, 8). To distinguish genotypes in the experimental population, parental haplotypes are reconstructed, using a hidden Markov-chain approach, to predict probabilities of inheritance from each of the eight progenitor strains for each SNP. Haplotypes are constructed for each rat across the genome using the multipoint haplotype reconstruction method HAPPY (http://www.well.ox.ac.uk/happy) implemented in R (16). The association studies in the HS involve a more specialized statistical approach, because of the population structure where individuals have different degrees of genetic relatedness (16). There are two strategies for dealing with relatedness: Mixed Models in which the genotypic similarity between individuals is used to model their phenotypic correlation (20) and Resampling-based Model Inclusion Probability (RMIP) in which loci that replicate consistently across multiple QTL models fitted on subsamples of the mapping population are identified (30). In both strategies, QTLs are detected by haplotype association (16). Mixed models perform better on normally distributed data (23). The variance components that correct for the genotypic similarity between individuals (pedigree relationships) are estimated using the EMMA package for R (20). The test for association is then performed via a mixed model and expressed as the negative log10 of the p-value (−LogP), and covariates can easily be incorporated into the model. The significance threshold is calculated as the −LogP that corresponds to the false discovery rate (FDR) established by permutation for each phenotype, conventionally set to 5 % FDR but other levels may be used. Conversely, RMIP performs better for non-normally distributed phenotypes and binary data. In this approach, a “multiple QTL model” is fitted using a model averaging method to obtain a posterior probability that a QTL will be included in the model (7). This is accomplished by repeatedly resampling the data and in each resample test which set of markers best explains the variation in the phenotype. In short, the association between phenotype and genotype at any one locus is corrected by the pattern of associations over the rest of the genome. The QTLs are fitted using R/Bagphenotype. The significance threshold is calculated as the RMIP that corresponds to the false discovery rate (FDR) as described for Mixed Models.

Fig. 5
figure 5

The genomic sequence of each HS animal is imputed from the HS founder sequences, based on a selection of genotyped SNPs. (a) The genotypes in the experimental population are distinguished by reconstructing the parental haplotypes to predict the probabilities of inheritance from each of the eight progenitor strains for each SNP. The SNPs and haplotypes are then tested for association. (b) The sequence-based and genetic mapping approaches can be combined to test whether a single SNP or a haplotype (the combination of variants from a single progenitor) is responsible for the phenotype, the so-called merge analysis. Any SNP logP that exceeds the maximum haplotype logP is a candidate for causal functional variants (Causal SNP)

3.6 QTN Identification in HS (if Applicable) (Fig. 5b)

The sequence-based and genetic mapping approaches can be combined to identify causative variants for genes and pathways associated with the phenotypes. If the founder strains are sequenced, the origin of each variant in the population is known. In the HS, this totals approximately 7.2 million SNPs. We can use this catalogue of segregating SNPs to identify genes and causative variants. The HS permits a test, called merge analysis, of whether a single SNP is responsible for the phenotype, or the combination of variants from a single progenitor, i.e., a haplotype, is causal (23). The test is possible because the haplotypes segregating in an HS are known (the stock is derived from the eight sequenced genomes). Any SNP that exceeds the maximum haplotype logP is a candidate for causal functional variant. The near complete sequence of the HS rats allows us to determine the presence of multiple causal variants at a locus in addition to haplotypic effects. The functional consequences of identified causal variants can then be modeled and experimentally tested based on position in the genome. For example, the consequences on protein structure of candidate variants lying within coding regions of genes can be predicted by modeling the sequence to known protein structures to elucidate altered binding affinities and physical interactions. For noncoding causative SNPs, regulatory sequence similarities and changes in transcription factor binding motifs can be modeled. The ultimate functional test entails capturing the sequence alternatives for causative haplotypes or SNPs in a congenic/transgenic/knock-in animal to test for functional consequence.

3.7 Validation Using Congenic Strains (Optional Step) (Fig. 6)

We highly recommend QTL validation before proceeding with often costly and tedious gene and QTN identification. Animal strains that carry isolated genes (positionally cloned or disrupted), i.e., congenic strains, offer a unique opportunity to validate the QTL and elucidate mechanisms underlying gene actions that contribute to the trait. They are similar to inbred strains, with the exception that the genomic region of interest has been transferred from a donor strain (B or W) onto a genetic background of different susceptibility (recipient strain) (see Note 4 ). A congenic strain can be produced by intercrossing two strains to create F1 hybrids, and then backcrossing the F1 to either parental strain (B or W). The genetic recombinations will create unique animals and the aim is to select one that carries the region of interest from the donor strain. They are backcrossed to the recipient strain for ten generations to ensure that the genomic background has minimum contamination with fragments of donor DNA. Alternatively, marker-assisted selection can be used (speed congenic), in which the background genome is screened to select the animal with least contaminating donor genome together with the region of interest, to establish a homozygous congenic strain in 5–7 generations. For detailed protocol on how to create a speed congenic strain see Wakeland et al (21).

Fig. 6
figure 6

Schematic illustration of congenic strain construction. One pair of autosomes is represented by vertical lines and mitochondria are represented by circles. Congenic strains are strains that carry isolated genes (positionally cloned or disrupted) from a donor strain (B or W) onto a genetic background of different susceptibility (recipient strain). They are backcrossed to the recipient strain for ten generations to ensure that the genomic background has minimum contamination with fragments of donor DNA. Congenic strains are used to validate the QTL and elucidate mechanisms underlying gene actions that contribute to the trait

4 Notes

  1. 1.

    DNA extracted from ear clips, commonly used to mark animals at a young age, can be used to perform genotyping prior to phenotype measurements. This can significantly speed-up the experiment as genotyping can be completed before the end of the experiment.

  2. 2.

    Initially, it is possible to genotype only animals that display extreme phenotypic values and perform analysis. However, we recommend genotyping all individuals.

  3. 3.

    Different fluorophores allow for multiplexing of microsatellite markers during the PCR step. It is however important to carefully plan the multiplexing procedure. The quality of the fluorophore signal in the sequencer will determine the number of different fluorophores that can be used to label primers. Theoretically, markers of the same size can be multiplexed if their primers are labeled with different fluorophores. In our experience, sometimes a signal from one fluorophore can be detected (leakage) in the channel of another fluorophore. Thus, we recommend that markers of different sizes should be used. Also in theory markers that differ in approximately 20 bp can be labeled with the same fluorophore and multiplexed. This will work well if markers have only few bands after PCR amplification. In our experience the best results are obtained with three fluorophores and up to eight markers that differ as much as possible in their size.

  4. 4.

    Ideally congenic strains should be constructed on both parental backgrounds (B and W). Practically this is not always possible due to allelic effects, trait architecture (the number of QTLs that affect the trait), QTL strength, etc. Let us assume that we use an F2 cross between W (expressing high phenotype values) and B (expressing low phenotype values), to identify five QTLs. QTL 1 shows a 10 % effect size and at this loci W genotype drives low phenotype expression. In this case, constructing a congenic strain that carries W genotype for QTL 1 (low phenotype expression, 10 %) on a genomic background of B strain (low phenotype expression) does not give optimal conditions to detect a phenotype difference. Another factor to consider is costs, and one should carefully select combinations for breeding that are most likely to generate useful tools, i.e., that have a robust phenotype difference between congenic and parental strain.