Background

Major scientific successes typically build on previous discoveries. This is certainly true of human gene identification studies. Work in model organisms initially delineated the principles behind genetic mapping (Sturtevant 1913), although even in model organisms it took 50 years to generalize the principles to complex quantitative traits (Thoday 1961). Over the same period, statistical geneticists outlined specific approaches needed for gene mapping in human data, describing early ideas leading to allele-sharing statistics (Penrose 1935) and the classical lod score approach (Haldane and Smith 1947; Morton 1955). It then took another 20 years for development of the first practical computational algorithms based on pedigree peeling (Elston and Stewart 1971) with implementation in computer programs (Ott 1974). Another decade passed before an ample source of necessary markers in the form of DNA-based variation began to be available (Botstein et al. 1980). The principles elucidated by these fundamental ideas mean that all are still in use, although methodologies implementing the ideas, including high-throughput sequencing, are now vastly cheaper, faster and more efficient.

Once all the pieces were in place, progress toward human gene identification was rapid. The fundamental underlying principles coupled with an ample source of markers lead to early studies that proved feasibility in humans by leading to genes for both recessive (Riordan et al. 1989) and dominant (Huntington’s Disease Collaborative Research Group 1993) genetic diseases. They also relatively quickly lead to early identification of genes affecting moderately complex traits, such as early-onset breast cancer (Miki et al. 1994) and Alzheimer’s disease (Levy-Lahad et al. 1995; Sherrington et al. 1995). The overall strategy was described as positional cloning (Collins 1991), although more recent studies should more properly be referred to as positional sequencing. These early studies illustrated the power of the pedigree-based design in leading to genes of interest with rare but highly influential disease alleles. With this approach, close to 4,500 genes relevant to human disease had been identified by the end of 2011 (Amberger et al. 2011). However, moving from a map location to gene identification was difficult because of the poor resolution of pedigree-based designs (Boehnke 1994), and most success stories represent Mendelian disorders or essentially Mendelian forms of more complex disorders.

Complex traits have yielded less frequently to this approach (McClellan and King 2010; Risch 2000). Struggles with the challenges of complex traits eventually lead to an alternative push toward use of population-based designs (Botstein and Risch 2003) under the assumption that much of complex trait genetics might be explained by common genetic variation (Collins et al. 1997) and detectable with association methods. With inexpensive genotyping of very dense SNP panels, genome-wide association studies in large population-based samples (GWAS) became common, with >1,200 publications since 2005 that report at least one association at a genome-wide significance level (http://www.genome.gov/gwastudies). There have, of course, been some clear success stories with identification of the underlying genes, such as complement H and macular degeneration (Klein et al. 2005). However, early optimistic predictions about the impact of these discoveries (Manolio et al. 2008) became more guarded for two reasons: (1) most GWAS studies have only lead to associations, with identification and verification of the actual risk variants a rare outcome (Hindorff et al. 2009); and (2) most estimated effects are small, thus explaining relatively little of the estimated genetic variance (Manolio et al. 2009). Given the small estimated effect sizes for most traits, sample sizes have now reached enormous sample sizes of up to hundreds of thousands (Speliotes et al. 2010) of subjects—several orders of magnitude more than that needed for pedigree-based studies. One current hypothesis is that rare variation is much more important than was earlier believed (Manolio et al. 2009). It is also probable that human variation includes all of the complex features found in model organisms: complex genetic architectures, existence of rare alleles, and the presence of many influential variants that are in non-coding regions (Flint and Mackay 2009), and therefore not easily amenable to functional prediction algorithms (Erlich et al. 2011).

Investigation of rare variants is the current frontier of human genetic disease. Ability to extensively survey existing DNA-based variation is increasingly feasible through use of rapidly improving high-throughput sequencing methods (Ng et al. 2010; Roach et al. 2010). The hypothesis that rare variants are important contributors to complex human traits is further supported by information from (1) early sequencing studies of subjects in the tails of the distribution of cardiovascular-related quantitative traits (Cohen et al. 2004), (2) indications that rare variants are more likely than common variants to have large effects (Bodmer and Bonilla 2008; Gorlov et al. 2011), (3) recent sequencing of regions surrounding seven genes associated with low-density lipoprotein levels (Sanna et al. 2011), and (4) large numbers of variants typically identified in genes affecting Mendelian disorders, such as familial hypercholesterolemia with >1,100 known variants (Leigh et al. 2008). Sample size requirements for GWAS of rare variants would need to increase substantially over current sample sizes. This leads to the inescapable conclusion that designs, other than simple population-based designs, may be critical, as high-throughput sequencing in the search for influential rare variation joins the research toolkit. Use of large pedigrees is one of several important designs in this context and the focus of this article.

Advantages of family-based designs

Harnessing segregation information

Large pedigrees provide a design that has considerable power in the search for rare trait variants. In this context, large pedigrees are those that are individually large enough to provide a statistically significant result, given good quality and quantity of data, and use of an efficient analysis approach. Such pedigrees will typically be at least 20–25 subjects, but will frequently be much larger. The trait loci may represent single genes with relatively high effects, or several closely linked loci, each with moderate effects that together have a large effect (Yazbek et al. 2011). Large pedigrees intrinsically have more power for detection of linkage or estimation of effects than do equivalent-sized samples of smaller families (Wijsman and Amos 1997) or unrelated subjects, particularly in the presence of rare variants such as those found in sequence data (Gagnon et al. 2011; Simpson et al. 2011; Wilson and Ziegler 2011). If pedigrees are sufficiently large, they can individually implicate genomic regions. For example, some current methods, which are modern versions of early allele-sharing methods, identify shared multilocus segments among small numbers of subjects. These methods rely on very large pedigrees with many meioses separating affected subjects within pedigrees (Leibon et al. 2008; Thomas et al. 2008). They do not require genotyping of intervening relatives, with the pedigree structure providing information about the expected distribution of the number and sizes of such segments. Large pedigrees, therefore, enable gene mapping and identification studies to be carried out with relatively small sample sizes (Wright et al. 1999). The large pedigree design is thus particularly useful in the presence of locus heterogeneity among families. The fundamental disadvantage of pedigree designs is that genomic regions identified via linkage analysis tend to be relatively large because of the coarse nature of the meiotic process (Boehnke 1994). In earlier years, the cost and difficulty of evaluating all variation in an identified genomic region was a barrier to identifying the gene(s) driving the evidence for linkage, particularly in smaller sample sizes for which the implicated regions were particularly large. However, with the recent revolutionary changes provided by new high-throughput sequencing technologies, genotyping costs for follow-up studies in such a region are no longer a major limitation since the whole region can be evaluated.

Evidence for influence of rare variants often includes evidence for co-segregation with traits in pedigrees. This is sometimes carried out in an ad hoc manner without quantification of statistical support (Cruchaga et al. 2012). However, evaluation for co-segregation between the trait and variant can be obtained by carrying out a classical linkage analysis, using one of the many sophisticated and available approaches that are reviewed elsewhere (Bailey-Wilson and Wilson 2011). A linkage analysis provides a suitable and well-calibrated statistical framework for evaluation of the role of a candidate variant, as demonstrated in an analysis of lipoprotein lipase gene variants and LDL size (Hokanson et al. 1999), or a more recent evaluation of a variant identified through high-throughput sequencing that is implicated in Charcot-Marie-Tooth disease (Weedon et al. 2011).

Family-based designs enrich for variants of interest. For example, selection of subjects from families with multiple affected subjects, or selection of subjects from families with extreme values, can enrich for multiple copies of a rare variant (Gorlov et al. 2011; Ionita-Laza and Ottman 2011), providing improved ability to measure, and detect, the effects of such variants. A contrast of two recent evaluations of several candidate genes for late-onset Alzheimer’s disease provides a useful illustration (Cruchaga et al. 2012; Gerrish et al. 2012). In both studies, genes identified through their role in early-onset AD and dementia were evaluated, with the same four genes evaluated in both studies. In the family-based study, investigators were able to use a highly efficient pooled-sequencing approach in 439 unrelated probands from late-onset families with multiple affected subjects/family (Cruchaga et al. 2012). This resulted in identification of a statistically significant excess of rare variants in these genes in the probands relative to unrelated controls, with follow-up co-segregation with the trait in the families used as part of the evaluation. In contrast, a population-based case–control sample was ~40 times larger, consisting of 17,313 subjects (Gerrish et al. 2012), but with no preference for family history. Even a targeted analysis of the four candidate genes produced no evidence for association with SNPs in the genes tested.

Complexities handled

Large pedigrees provide some protection against the highly deleterious effects of heterogeneity. This is a key strategy to obtain interpretable results with relatively modest sample sizes, even though genetic heterogeneity is inevitable for complex traits (Sillanpaa and Auranen 2004). Large pedigrees can each be more homogeneous with respect to genetic variation than a combined sample of many smaller families, while providing enough information to detect linkage within individual large families. This strategy of using large families for initial screening has long been a highly successful strategy when there is high genetic heterogeneity, as illustrated by traits such as hereditary hearing loss (Varilo and Peltonen 2004) and dilated cardiomyopathy (Hershberger et al. 2010). The strategy has also been both proposed and used in the search for loci contributing to complex traits (Wright et al. 1999). This same strategy of using large pedigrees also worked well to identify loci for simulated trait data as part of Genetic Analysis Workshop 17 (Gagnon et al. 2011; Simpson et al. 2011; Wilson and Ziegler 2011) in which evidence for linkage was clearly identified in the large pedigrees, without a clear comparable signal in sequence data available for unrelated subjects.

Large pedigree data sets frequently include extensive and unusually complete phenotype information. Such large families are typically collected by investigators with a deep interest in the phenotypes, and develop strong rapport with the subjects. As a result, new phenotypes may be collected on the same participants over time, and may include expensive or unusual phenotypes. Examples include the well-known Framingham Heart Study (Jaquish 2007), the Strong Heart Study (Lee et al. 1990), and large families with familial combined hyperlipidemia under study at several institutions (Jarvik et al. 1994; Pajukanta et al. 2003; Rosenthal et al. 2011). Such samples represent a rich phenotypic resource that can be mined for biological insight and phenotypic subgroups. A rich phenotypic data set also can provide the opportunity to identify pre-clinical phenotypes through evaluation of biomarkers in unaffected relatives of cases with known primary mutations, as is being attempted in both Crohn’s disease (Hedin et al. 2012) and early onset Alzheimer’s disease (Morris 2011). All of these sources of information can provide a unique and informative perspective on the genotype–phenotype relationship that is impossible to achieve in huge studies that depend on combining measures that are jointly collected in large numbers across different sites.

Pedigrees facilitate error detection. For example, most sample swaps are not only easy to identify in pedigree genome scan data, but also can be corrected in silico (Boehnke and Cox 1997). Similarly, genotype, and even phenotype, errors are detectable through identification of low probability situations in meiotic transmissions (Buetow 1991; Ehm et al. 1996) with improved detection derived from using more subjects per pedigree (Mukhopadhyay et al. 2004). The higher rate of error associated with high-throughput sequencing means that error detection methods are likely to increase in importance. Similar to in silico correction of pedigree error, it is likely that discrepant sequencing calls can be corrected by capitalizing on information about genomic segments shared identical-by-descent within pedigrees. Software for error detection associated with high-throughput sequencing in large pedigrees has not yet been released. However, proof-of-concept studies show that modern computational tools enable extension of these ideas to error detection with dense data in large pedigrees (Cheung et al. 2011; Markus et al. 2011), so that it is only a matter of time before useable software is available.

Challenges

Use of large pedigrees also introduces certain complications. Some of these involve requirements for additional required information, including the complete pedigree structure, marker allele frequencies, and, for any kind of multipoint analysis, a meiotic map. A sequence-based map is not a particularly good proxy (Wijsman et al. 2007), so invariably some interpolation onto an existing high-quality meiotic map is necessary (Matise et al. 2007). There are also particular challenges with using very dense genotype data, since close markers typically are in linkage disequilibrium (LD). This affects the prior probability distribution of haplotype frequencies, which, in turn, strongly affects analysis results (Ott 1992; Sieh et al. 2007). Attempts to model the LD genomewide in pedigree analysis have so far been unsatisfactory (Thomas 2007). Alternatives are to treat blocks of markers as non-recombining multi-allelic segments with empirically determined haplotype frequencies (Abecasis and Wigginton 2005; Sieh et al. 2007) or to thin markers sufficiently to avoid the problem of LD. The former option has not yet been implemented for large pedigrees, although it works reasonably well for smaller pedigrees. The latter is an option that can be used in most settings, as discussed further below.

Computation

Computations on large pedigrees are no longer the major bottleneck of the past. While exact computation remains impractical on large pedigrees with many markers (Thompson 2011), there are now excellent alternatives. This makes it possible to include new types of data, such as genotypes from high-throughput sequencing. Practical computation on large pedigrees is based on Markov chain Monte Carlo (MCMC) methods, and implementations have steadily improved since their inception (Sobel and Lange 1993; Thompson 1994). Existing packages, including SIMWALK2 (Sobel et al. 2001), MORGAN (Thompson 2005, 2011; Tong and Thompson 2008), as well as other programs (Thomas et al. 2000), provide excellent results for analysis of multipoint marker data for pedigrees without a large number of loops. These MCMC methods sample from the possible underlying inheritance configurations, with the sampled configurations typically used under a general analytic framework that can incorporate many different linkage statistics (Sobel and Lange 1996). Other MCMC-based approaches allow analysis with complex trait models that otherwise are computationally intractable (Daw et al. 1999; Heath 1997), allowing use of models that may be more useful for analysis of complex traits.

The accuracy of results obtained with some MCMC-based approaches has been extensively tested for a variety of simulated pedigree structures and marker configurations (Wijsman et al. 2006). For initial computations on pedigrees, the coarse nature of meiosis means that thinning markers to 1 SNP marker per cM is both acceptable as measured by little loss of inheritance information despite reduction in the data used, and is computationally advantageous (Wijsman et al. 2006). This is consistent with analysis of real data carried out with a variety of methods (Wilcox et al. 2005) that showed that such marker thinning did not cause substantial loss of information, even though some data were then ignored. SIMWALK2 and MORGAN have comparable performance on sparser marker panels, such as panels of multiallelic markers. However, for denser markers, e.g., SNPs or sequence-based markers, the computation time for MORGAN used at the time of comparison was two orders of magnitude faster than that of SIMWALK2 (Wijsman et al. 2006). More recent improvements have yielded even better performance (Tong and Thompson 2008).

One current program that also is likely to be useful in the context of modern genotyping data is gl_auto from the MORGAN package. This is the first distributed program that outputs the sampled inheritance patterns so that they can be used for a variety of analyses without requiring the end user to understand and modify the more complex MCMC-based programs. For example, the investigator can take advantage of the factoring of computations associated with the markers versus the trait (Sobel and Lange 1996), or can carry out several different analyses that all depend on the same sample of inheritance patterns, such as analyses of different traits, or with different analysis methods. An additional computational cost saving comes from the program IBDgraph (Koepke and Thompson 2010), which recognizes equivalent configurations in the inheritance patterns so that redundant computations can be avoided: recent application to a moderately large Alzheimer’s disease family sped up computations by a factor of 10 (Marchani and Wijsman 2011).

Information

The large amount of underlying variation creates challenges for use of data generated by high-throughput sequencing. The extent of this rare variation was first revealed by early applications of high-throughput sequencing (Ng et al. 2010; Roach et al. 2010). While samples of unrelated subjects are either severely underpowered or exceedingly expensive because of large required sample sizes, large pedigrees are uniquely suited for study of rare variants. Large pedigrees also allow efficient use of resources: in order to extract maximal information, sequence data on only a small number of subjects needs to be added to an existing genome scan marker data set, which typically either already exists or can be generated relatively cheaply. The two sources of data can easily be combined and interrogated with available statistical genetic analysis methods. The linkage analysis results can also be used as filters to effectively focus investigations on small segments of the genome (Musunuru et al. 2010; Smith et al. 2011; Wang et al. 2010), or as weights to incorporate results from different sources of information (Roeder et al. 2006). Use of large pedigrees, therefore, can prevent the twin problems of high cost and low power that are intrinsic to use of population-based studies of rare variants. For example, two recent studies were able to identify causal variants in pedigrees of sizes 24 and 42 by sequencing only 2–4 subjects per pedigree (Musunuru et al. 2010; Wang et al. 2010), and by combining this information with existing linkage panel markers.

Most early successful applications of high-throughput sequencing methods have demonstrated usefulness of high-throughput sequencing through application to rare Mendelian disorders. The earliest such applications focused on a handful of mostly unrelated subjects, but were feasible only because they combined a focus on exquisitely rare Mendelian recessive traits with a series of ad hoc but reasonable filters to nominate genes and variants based on predicted function and frequency (Ng et al. 2009, 2010; Roach et al. 2010). This strategy fails in situations that are typical of complex traits where (1) there is genetic and etiological heterogeneity among subjects, (2) the traits are continuous, (3) the frequency of relevant variants may be greater than vanishingly rare, and (4) the effects of influential variants are unlikely to be easy to predict based on sequence information. In these situations, other analysis approaches and/or filtering strategies are necessary. These are found in the repertoire of existing analytical approaches for pedigree data.

Use of linkage analysis is an effective filtering strategy. Relative to older technologies, positional sequencing with current technologies is much faster, provides much more extensive data on a region, and provides a more comprehensive evaluation of regional variation. As a result, the whole region implicated by a positive linkage signal is now easily sequenced, and it is no longer necessary to accrue a large sample just for the purpose of narrowing a region of interest (Boehnke 1994). Linkage analysis results can therefore be used as a filter to focus on a comprehensive analysis of all or most variants in the region with evidence of linkage (Smith et al. 2011). For traits with a clear mode of inheritance, the region of interest would presumably be that bounded by obligate recombinants. For more complex traits, other criteria may need to be determined. This presumably would include greater weight given to variants in regions with the strongest evidence of linkage than in more peripheral regions, although functional annotation or other biologically based information could also be used, in principle, to prioritize variants. With significant evidence of linkage, it is also only necessary to sequence a small number of carefully chosen subjects per pedigree, preferably with information from the linkage analysis results guiding the choice of subjects (Marchani and Wijsman 2011). This strategy has proven to be efficient for identifying causal mutations for a rapidly increasing number of Mendelian genetic diseases and traits, including some that are found in relatively large pedigrees (Southgate et al. 2011; Weedon et al. 2011), and some of which represent exceedingly rare diseases in large pedigrees (Raskind et al. 2009, 2011). Also, by focusing analysis on a limited number of variants defined by a particular region, this strategy does not require the use of imperfect functional prediction tools that may miss key variants (Erlich et al. 2011), and allows interrogation of all or most variants in a region.

The same general strategy of sequencing small numbers of individuals in key pedigrees also yields genes that affect complex traits. Several studies have reported gene identification in the context of Mendelian traits that represent very high levels of genetic heterogeneity. Each study focused on regions with significant linkage evidence in single large pedigrees, including ataxia (Wang et al. 2010), familial dilated cardiomyopathy (Norton et al. 2011), and thoracic aortic aneurism (Regalado et al. 2011). Some of the earliest successful studies of more complex, multilocus traits also started with one or two large pedigrees, with subsequent focus on genes in region(s) with significant evidence of linkage in those pedigrees. These studies include: (1) a variant in affecting plasma adiponectin levels identified as a candidate in a region with linkage evidence (Bowden et al. 2010), (2) two studies that also used quantitative trait linkage analysis and followed this with a comprehensive measured genotype approach of all variants in the region identified from sequencing (Musunuru et al. 2010; Rosenthal et al. 2011), and (3) one study with a focus on a candidate gene, but with comprehensive sequencing and a measured genotype approach to evaluate all variants in the gene (Calafell et al. 2010). In these latter studies, the measured genotype approach (Almasy and Blangero 2004) was feasible because of the restricted region of interest: in this context, it is computationally feasible to every variant in a region for its ability to explain the segregating variance. Also, it is important to note that none of these studies required sequencing of large number of subjects.

The measured genotype approach is particularly useful for evaluating effects of variants in large pedigrees, and essentially combines information from both association and linkage. The approach efficiently extracts information about sequence-based variants through two outcomes. First, there is an inherent increase in information obtained by having multiple genotyped copies of a single rare variant in a pedigree. This occurs naturally when a risk variant drives a linkage signal because of co-segregation with the trait. Second, by combining sequence data with existing marker data in large pedigrees, it is possible to impute variants into other individuals who have phenotype, but at most partial genotype data, further increasing the information available for evaluating the effects of variants. Such imputation can be carried out jointly with the linkage analysis (Heath 1997; Rosenthal et al. 2011; Wijsman et al. 2010), but this can be computationally intensive. Alternatively, it is possible to use the existing data on the pedigree to impute variants with a genotype calling algorithm, similar to approaches used for genotype imputation in unrelated subjects (Li et al. 2009). In the context of pedigrees, the shared inheritance of chromosomal segments within a pedigree provides the correlation needed for imputation. Methods that provide imputed genotypes in pedigrees already exist for pedigrees that are small enough for exact computation (Burdick et al. 2006). Similar methods that accommodate much larger pedigrees are also now becoming available (Cheung et al. 2010). These methods sample inheritance vectors (IVs) within pedigrees (Thompson 2011) obtained from analysis with a framework panel of markers. The IVs are obtained with either exact computation or Markov chain Monte Carlo (MCMC) by, e.g., gl_auto in MORGAN. Deterministic (Wijsman 1987) followed by probabilistic inference then allows computationally rapid imputation of the dense sequence-based variants in the remaining subjects, with a pre-defined, and tunable, error rate. Although it would be inappropriate to use them for linkage analysis, the imputed genotypes may be used for, e.g., a computationally rapid analysis approach, such as a variance components analysis (Almasy and Blangero 2004) to determine which variant(s) explain the segregating variance, followed by targeted genotyping of key variants in more subjects.

Conclusions

We have argued here that large pedigrees should be one of the designs for which high-throughput sequence data should be generated, particularly when rare variants are suspected to play a role in disease risk. Large pedigrees are uniquely suited for the study of rare variants, and can provide statistically significant evidence of both co-segregation with the disease or trait and of existence of genotype–phenotype association. Also, a small amount of sequencing added to an otherwise well-characterized large pedigree can yield considerable information even on subjects with no sequence data, thus making this approach highly cost effective. Although the goal here was not to provide a comprehensive evaluation of existing software, appropriate computational tools and statistical methods already exist for analysis of large pedigrees, so there is no major practical barrier to data analysis. In the search for the genetic basis of complex traits, it is important to use the best available tools, and not to eschew well-established designs and analysis methods for lack of novelty. This will enable the information in the sequence data to be efficiently obtained and used by drawing upon existing methods with well-understood properties. As a result, the quality of the resulting scientific inference obtained from the use of novel genotyping technologies will be maximized.