Introduction

The National Longitudinal Study of Adolescent to Adult Health (Add Health) is a nationally representative longitudinal study including over 20,000 adolescents originally sampled in Grades 7–12 in the United States between 1994 and 1995. Add Health respondents have been followed through adolescence and into early adulthood with four in-home interviews (1995, 1996, 2001–2002 and 2008–2009). The Add Health design included the oversampling of approximately 3,000 pairs of individuals who were raised in the same household. These pairs of individuals are biologically related to varying degrees including monozygotic (MZ) and dizygotic (DZ) twins, full siblings, half siblings and unrelated. For further details on the study design and sampling scheme for the Add Health Sibling Pairs Sample, including phenotypic, environmental and biological assessments, see Harris et al. (2013). During the fourth in-home visit (Wave IV; 2008–2009), Add Health collected saliva on the entire sample of Add Health respondents (N = 15,701), including the sibling pairs subsample. Consent rates (consent to provide saliva for DNA extraction) among the sibling pairs subsample for Wave IV saliva collection was an impressive 96 %, which was similar to the consent rate for the entire Add Health sample. See Harris et al. (2013) for additional information on the Add Health Study design and genetic data.

Genome-wide association studies have largely been conducted using case–control and/or cross-sectional study designs primarily due to efficiency and ease of collection. The integration of genome-wide data into well-characterized longitudinal and prospective cohort studies that include biological relationships such as the Add Health sibling pair subsample has been much more limited. Notable exceptions include the Framingham Heart Study [NHLBI SNP-Health Association Resource (SHARe)] that follows multi-generational samples prospectively, and the Health and Retirement Study (HRS) that surveys a representative sample of individuals over the age of 50 every 2 years and follows them prospectively. Utilizing genetic data from longitudinal and prospective cohort studies has many potential advantages including refinement of phenotypic endpoints, phenotypic change and trajectory. Within the context of the ethnically diverse Add Health sibling pairs subsample of adolescents and young adults, there are additional advantages including family-based assessment and utilizing measured environmental and social factors collected over time.

Here, we provide a description of the genome-wide data that were generated on the Add Health sibling pairs subsample. In particular, we focus on describing the targeted sample for genotyping, the quality control (QC) and quality assurance (QA) steps that were taken and how putative biological relationships were assessed. Using genome-wide data from known reference populations, we also show the genetic ancestry of the Add Health sibling pairs subsample. We also explore the genetic heritability of body mass index (BMI) using the genome-wide data from the Add Health sibling pairs subsample. Finally, rather than conducting a traditional cross-sectional genome-wide association study (GWAS) of BMI, we opt to utilize the rich genome-wide information publicly available to conduct a weighted GWAS of longitudinal BMI while accounting for both family and ethnic variation. Funding for the genotyping of the sibling pairs subsample was provided by the National Institutes of Child Health and Human Development (R01 HD060726).

Materials and methods

Quality control and quality assurance

The QA/QC Report for the Add Health Sibling Pairs Sample is provided in the Supplemental Materials. Briefly, there we describe how the sample was selected, prepared and genotyped, the number of markers removed, the number of samples removed, sex checks and duplicate concordance. The number of individual samples deemed of high quality for subsequent relationship testing, ancestry estimation and genome-wide analysis is N = 1,888. The number of SNP markers (chromosomes 1–22 and ×) with a genotyping call rate of at least 95 % is N = 940,862.

Computer software

For biological relationship testing, PLINK (Purcell et al. 2007) and Kinship-based Inference for GWAS (KING; Manichaikul et al. 2010) were used. For genetic ancestry estimation, we used KING (Manichaikul et al. 2010) and ADMIXTURE (Alexander et al. 2009). R (R Core Team 2013) was used for graphical display of ancestry information. For the estimation of heritability using genome-wide data, genome-wide complex trait analysis (GCTA; Yang et al. 2011) was used. For the GWAS we used SAS 9.3 (SAS Institute, Cary, NC, USA) and R. Once again, R was used for graphical display of the genome-wide association results.

Estimation of genetic relatedness

Using information from chromosomes 1–22 (919,509 SNP markers) on the clean set of 1,888 individual samples, we estimated identity by state (IBS) and identity by descent (IBD) using PLINK (Purcell et al. 2007) as well as the Kinship Coefficient using KING (Manichaikul et al. 2010). These measures are used to test duplicate concordance, confirm expected biological relationships, identify unknown or cryptic relatedness in the sample and provide the information necessary to assess genetic ancestry. The relationship measures are calculated pairwise for all individuals in the dataset. As generally recommended, we pruned autosomal SNPs to establish an approximately independent set of SNP markers to be used for IBS, IBD and Kinship Coefficient estimation. We used a linkage disequilibrium threshold (r 2) of 0.20 with a SNP window size of 50 and number of SNPs to shift window at each step of 5 (PLINK command:indep-pairwise 50 5 0.20). After pruning, a set of 231,649 autosomal SNP markers in approximate linkage equilibrium was used to estimate the relationship measures. Pairwise mean IBD was estimated using PLINK (“PI_HAT”). However, PLINK’s estimates of IBD may be biased in stratified (multi-ethnic) samples (Manichaikul et al. 2010; Thornton et al. 2012). Therefore, we relied upon the KING package to provide estimates of relationship (Kinship) that are robust to stratification.

Estimation of genetic ancestry

We explored genetic ancestry in two different ways. Note that the sample of N = 1,888 individuals with clean genotypes includes two MZ twin pairs. For the purposes of estimating genetic ancestry, we removed one individual (randomly) from each of the two MZ twin pairs resulting in a final analysis sample of N = 1,886. For our first approach to estimating genetic ancestry, we used KING (Manichaikul et al. 2010) to identify clusters of individuals based upon genetic similarity. KING uses multidimensional scaling (MDS) with Euclidean distance to generate principal coordinates (PCs) that can be used to identify population substructure. For the KING procedure, we used the same set of 231,649 autosomal SNP markers in approximate linkage equilibrium that was used for the estimation of genetic relatedness.

Second, we explored genetic ancestry using the software package, ADMIXTURE (Alexander et al. 2009). ADMIXTURE uses an efficient likelihood model-based estimation of genetic ancestry using genome-wide data. For the ADMIXTURE procedure, we opted for a supervised analysis utilizing a series of known genetic ancestry populations as fixed groups to estimate the proportion of ancestry that individuals from the Add Health sibling pairs subsample share with each ancestral reference population. The ancestral populations used were derived from the Human Genome Diversity Project (HGDP; Li et al. 2008) and International Haplotype Map Project (HapMap; International HapMap 3 Consortium, 2010). Specifically, we utilized 108 samples from the HGDP to represent the Americas (Surui, Maya, Karitiana, Pima and Colombian), and 402 samples from HapMap to represent Europe (CEU), Africa (YRI), China (CHB) and Japan (JPT). In all, we identified 257,035 SNP markers that overlap across the Add Health sibling pairs subsample, the HGDP sample and the HapMap sample. For efficiency using the program ADMIXTURE, we created an autosomal SNP marker set that was in approximate linkage equilibrium (123,198 SNPs) to estimate ancestry.

GCTA heritability of BMI

We used the GCTA software (Yang et al. 2011) to estimate heritability of BMI as measured in Add Health as part of the Wave 2, Wave III and Wave IV data collection. BMI was calculated using the standard formula of mass (kg) divided by height (m) squared (kg/m2) for each respondent. GCTA works by first estimating the genetic relatedness between all possible pairs of individuals. The genetic relatedness measures are known to be sensitive to population stratification, so for this application, we restricted the analysis to white respondents only. The subsequent step in the GCTA process is the estimation of a random effects model, where the random effects have a covariance structure based on the estimated genetic relatedness values. The percentage of total variance associated with the genetic random effects is considered the estimated heritability. For this particular study, we removed all pair-wise relationship measures above 0.025.

Genome-wide association approach

To conduct SNP-by-SNP genome-wide association analysis of BMI, we started initially with 919,509 autosomal markers with a genotyping call rate greater than 95 %. Further steps involved removing SNP markers that show evidence of deviation from Hardy–Weinberg equilibrium (HWE) in 492 unrelated, self-identified white individuals extracted from the entire sample. These 492 individuals were selected via a two-step process. First, we focused on the homogenous self-identified white sample followed by the random selection of one individual from each biological relationship pair. In all, 6,237 autosomal SNPs were flagged for potential deviation from HWE (p < 0.001) and removed from the genome-wide association analysis. On the basis of minor allele frequency (MAF), we further removed SNP markers with an MAF < 0.01 (32,313). Therefore, the final genome-wide association marker set includes 880,959 autosomal SNPs. As noted previously, the sample of N = 1,888 individuals with clean genotypes includes two MZ twin pairs. For the purposes of the genome-wide association analysis, we removed one individual (randomly) from each of the two MZ twins pair resulting in a final analysis sample of N = 1,886.

To optimize statistical power, rather than conduct a traditional family-based association analysis on the related sets of individuals, we opted for a more flexible linear mixed effects model (Bates et al. 2014). This approach allowed us to model longitudinal measures of BMI (Waves II, III and IV) from all 1,886 individuals while accounting for biological relationships (if present) and within-individual variation in BMI as well as controlling for age, sex and MDS-derived components of ancestry. BMI measures from women who were pregnant were excluded from this analysis. Note that only five respondents did not have BMI measures across all three waves.

Weighted association

A major issue plaguing genome-wide studies is multiple testing that arises from testing hundreds of thousands (if not millions) of SNP markers for association with the disease or trait of interest. In response to this issue, many investigators have advocated the use of a Bonferroni-correction to limit the probability of committing type-I errors. However, this comes at a cost of simultaneously increasing the probability of committing type-II errors, thereby diminishing the opportunity of detecting true association signals. This is particularly true of smaller genome-wide association datasets such as the sibling pairs samples. One solution is to utilize prior information into the association scan. In this study, we use a weighted association approach as implemented by Roeder et al. (2006) to accomplish this. While there are a variety of ways to construct weights, there are only two criteria that must be met. First, each weight must be greater than 0 and the mean of the weights must be 1. There are numerous sources of prior information that can motivate the weighting scheme including linkage scans, bioinformatics information, as well as previously conducted (and independent) genome-wide association signals (Roeder et al. 2007; Roeder and Wasserman 2009). Further, the prior information can be in the form of test statistics (i.e. LOD scores, Z scores) or p-values (Roeder et al. 2006).

The weights for this study were derived from the GWAS on BMI as conducted by the Genetic Investigation of Anthropometric Traits (GIANT) consortium (Speliotes et al. 2010). Details on the sample and the analysis procedures can be found elsewhere (Speliotes et al. 2010). Briefly, the GIANT consortium conducting a GWAS on BMI using 249,796 individuals and made the association signals for each of the ~2.8 M SNP markers available to the public. In particular, the p-values from the GWAS served as the prior information used to devise the weighting scheme for the genome-wide association scan from this study. In the original introduction to this approach, Roeder et al. (2006) introduced exponential and cumulative weighting procedures. We opted for a cumulated weighting scheme that can be less sensitive to large prior association signals and we also used a scaling factor (B) of 2 (Roeder et al. 2006).

We focused on markers that either overlapped between the GIANT consortium and this study or GIANT consortium markers that were in reasonable linkage disequilibrium (r 2 > 0.80) with SNP markers from this study. In all, we identified 717,411 markers to be tested for association that also have corresponding weights from the GIANT consortium. As a result, p-values from this study may be up-weighted or down-weighted depending upon the association signal from the GIANT GWAS. More specifically, the unweighted (nominal) p-values from this study are divided by the weights as assigned through the GIANT GWAS to generate the weighted p-values. The weighting procedure was conducted in R using the “weighted_FDR.R” script that can be found at http://www.wpic.pitt.edu/wpiccompgen/fdr/.

Results

Biological relationships

Table 1 uses the Kinship Coefficient generated from KING to tabulate the observed pairwise relationship status based upon genome-wide data versus the expected pairwise relationships based upon information from Add Health (using the Add Health variable, “sibcl4”). The sibcl4 variable is one of the classification variables available for the Add Health sibling pairs subsample. This particular classification designates pairs of respondents into monozygotic twin pair (MZ), dizygotic twin pair (DZ), full sibling pair (FS), half-sibling pair (HS), cousin pair (CO), unrelated pair (UN) and undetermined relationship (UD). A total of 1,781,328 (1888 C 2) pairwise relationship comparisons were conducted. As recommended by the authors of KING (Manichaikul et al. 2010), a Kinship Coefficient greater than 0.354 is categorized as an MZ twin pair (duplicates have been removed), between 0.177 and 0.354 as 1st degree relationship (DZ twin pairs and full sibling pairs—no parent-offspring are present), between 0.0884 and 0.177 as 2nd degree relationship (half-sibling pair and avuncular), between 0.0442 and 0.0884 as 3rd degree relationship (half-avuncular and first cousin) and less than 0.0442 as not related (NR).Note that the expected values for the Kinship Coefficient are 0.50, 0.25, 0.125, 0.0625 and 0.0 for MZ, DZ/FS, HS, CO and UN respectively and the boundaries suggested by the authors of KING are to account for the variability in the estimated Kinship based upon genome-wide data. As can be seen in Table 1, the vast majority of expected relationships are consistent with the observed genetic relationships. However, there are notable discrepancies. For example, there are 33 expected full sibling pairs that are consistent with being half-sibling pairs according to the genetic data. Further, there are two pairs of MZ twins—one set of MZ twin pairs were thought to be a DZ twin pair while the other set were unknown prior to this study. Additionally, a pair of individuals who were thought to be an MZ twin pair is likely an DZ twin pair. As can also been seen in the table we have detected 266 pairs of individuals thought to be unrelated who are at least distally related. The majority of these pairs (264) are 3rd degree relationships (i.e., cousins). In total, there are 664 full sibling/DZ twin pairs (1,328 individuals) that would be utilized for studies employing a traditional sibling pair family-based design. Based upon these relatedness measures, we have created a new variable available in the Add Health data sources called “sibclg” that specifies the biological relationship based upon the genome-wide information as presented here. This variable will be made available to Add Health researchers through the Restricted-Use Data mechanism of Add Health and will be included with other variables related to the sibling pair data.

Table 1 Observed versus expected relationship status

Self-reported ethnicity and multidimensional scaling (MDS)

We used the Add Health “ah_race” variable for self-report ethnicity. This variable includes five categories: White, Black, Native American, Asian and Hispanic. Add Health respondents who self-reported as Hispanic were included in the Hispanic category irrespective of whether they also self-reported as White, Black, etc. Of the 1,886 individuals included in this analysis, there are 917 who self-identify as White, 677 who self-identify as Black, 209 who self-identify as Hispanic, 73 who self-identify as Asian, 8 who self-identify as Native American; 2 individuals have unknown self-identified ethnicity (see Table 2). Supplemental figure 3 shows the first 5 principal coordinate (PC) MDS estimates from KING, color-coded by self-identified ancestry. As can be seen in supplemental figure 3A, the first principal coordinate (PC1) distinguishes the European ancestry from African ancestry while the second principal coordinate (PC2) captures variation between European, Asian and to an extent, Hispanic ancestry. In supplemental figure 3B we see variation from Hispanic samples including a small set of self-identified Native American individuals. Supplemental figures 3C and 3D further distinguish between and within self-identified ethnic groups.

Table 2 Self-identified race and proportion of genetic ancestry

Self-reported ethnicity and genetic ancestry

Table 2 and supplemental figure 4 displays the proportion of ancestry shared with each of the reference populations of each individual from the Add Health sibling pairs sample. As can be seen in Table 2 and supplemental figure 4A, the vast majority of individuals who self-identify as White have predominately European ancestry (CEU). Supplemental figure 4B illustrates the admixed ancestry typical of African Americans, self-identifying as Black. Self-identified Black individuals have a varying degree of African (YRI), European (CEU) and American (AMR) ancestry as can be seen in Table 2. Likewise, self-identified Hispanic and Native American individuals (supplemental figures 4C and 4D) display an admixed ancestry largely comprised of American (AMR), European (CEU), African (YRI) and to a lesser extent, Chinese (CHB) and Japanese (JPT) ancestry. Finally, supplemental figure 4E includes self-identified Asians. As can also be seen in Table 2 the ancestry of this subgroup is primarily of Chinese origin (CHB) but with measurable admixture of Japanese (JPT) and European (CEU).

GCTA heritability and weighted genome-wide association analysis of BMI

The GCTA heritability of Wave II, Wave III and Wave IV BMI based upon a sample of white respondents was estimated to be 0.82 (SE = 0.081), 0.71 (SE = 0.091), and 0.67 (SE = 0.084) respectively. Using a bivariate approach, the GCTA genetic correlation between Waves II/III, Waves II/IV and Waves III/IV were estimated to be 0.95 (SE = 0.031), 0.85 (SE = 0.042) and 0.99 (SE = 0.033) respectively. The degree of relationship among the 1,886 individuals used in the genome-wide association analysis varies. In all, there are 614 individuals who are not 1st degree relatives (siblings) of others in the sample (N = 614), 609 sibling pairs (N = 1,218 individuals) and 18 sibling trios (N = 54). Tables 2 and 3 provide the characteristics of the sample used for the genome-wide analysis. As noted previously and seen in Table 2, approximately half of the sample self-identifies as White, 36 % Black, 11 % Hispanic, 4 % Asian and less than 1 % as Native American. Table 3 shows the frequency of males (48 %) and females (52 %) as well as the mean age and BMI of the sample at each wave of collection. Consistent with other studies, the mean (and standard deviation) of BMI increases throughout young adulthood.

Table 3 Characteristics of the genome-wide association sample

To assess for the presence of systematic biases in genome-wide analyses, we generated a quantile–quantile(Q–Q) plot of the unweighted p-values (supplemental figure 5). As can be seen in the Q–Q plot, there is no evidence of widespread bias that is generating the associations. To illustrate the distribution of the weights as derived from the GIANT consortium, we provide a simple histogram (Fig. 1). As can be seen in Fig. 1, the vast majority of the p-values genome-wide are effectively down-weighted (<1; gray bar, N = 522,093) while a fair number of p-values are substantially up-weighted (>1; colored bars, N = 195,318). These weights were applied to each of the nominal p-values generated via the linear mixed effect model of longitudinal BMI. The resulting −log10 weighted p-values are displayed using a traditional Manhattan Plot (Fig. 2). The red horizontal line on Fig. 2 represents a genome-wide significant threshold (p = 5 × 10−8) while the blue horizontal line represents a threshold of p = 5 × 10−5. Table 4 displays the 39 SNP associations achieving p < 5.0 × 10−5 ordered by genomic location. We chose a threshold of p < 5.0 × 10−5 as a reasonable, albeit arbitrary, threshold for association signals that warrants potential follow-up as other studies have done (e.g. see Carty et al. 2012). In the table, we report the SNP marker name, chromosome, base pair location, and the nearest gene and where that SNP is located relative to that gene. We also report the allele conferring risk (increasing BMI units), the frequency of that allele and the other allele present in the data (reported as forward strand). Finally, we report the results of the analysis including the linear mixed model coefficient (corresponding 95 % confidence interval), the unweighted p value, the weight applied to that SNP marker and the corresponding weighted p-value. In this analysis, the linear mixed model coefficient may be interpreted as the additive effect of the risk allele on BMI in units of kg/m2. For example, the C allele of SNP rs1421085 (FTO gene) is associated with a 0.27 kg/m2 increase in BMI. Overall, the list of 39 SNPs include some marker pairs that are likely in high linkage disequilibrium (LD) from the same genic region. Additionally, the influence of the weighting scheme can clearly be seen in the table. Using weights in this way allows for strongly significant markers to sift to the top even when down-weighted. There are six SNP markers that were down-weighted, yet still achieved a genome-wide suggestive level. However, the remaining 30 SNP markers were up-weighted. Therefore, this is largely a list of SNPs that have been pushed towards the top of the association signals as they are SNPs with prior information indicating evidence of association with BMI (GIANT consortium) and achieved at least nominal significance in the Add Health sample. A notable signal includes the highly replicable FTO gene region (chromosome 16) providing evidence that the Add Health sibling pairs sample is an informative genetic dataset for future use.

Fig. 1
figure 1

Histogram of the weights derived from the GIANT consortium applied to the genome-wide association p-values

Fig. 2
figure 2

Manhattan plot of the weighted GWAS association signals

Table 4 Top weighted genome-wide association signals (p < 5 × 10−5) ordered by genomic location

Discussion

The primary focus of this study was to introduce the Add Health sibling pairs subsample genome-wide association data and conduct initial analyses to demonstrate the scientific potential of the data as a resource to the Add Health community of researchers. Given the unconventional (among traditional genome-wide studies) Add Health pairs subsample, we adopted a relatively unconventional approach to carry out the genome-wide analysis. First, it is estimated that 96 % of all genome-wide studies have been conducted on people of European descent (Bustamante et al. 2011). The reasons and explanations for focusing so exclusively on samples of European descent range from convenience and efficiency (using existing cohort studies that focus on subjects of European descent) to minimizing sources of genetic heterogeneity (Pulit et al. 2010; Bustamante et al. 2011). However, recently, there has been a series of studies that have empirically demonstrated a critical role of multiethnic studies in genome research of complex disease (Pulit et al. 2010; Masunuru et al. 2012; Carlson et al. 2013; Gong et al. 2013; Manichaikul et al. 2012; Manku et al. 2013; Marigorta et al. 2013; Sabater-Lleal et al. 2013). Often, these multiethnic studies will conduct genome-wide analyses within a relatively homogenous European descent sample and simultaneously conduct a genome-wide analysis among a more genetically diverse sample such as African-Americans before combining the association signals using meta-analysis. An alternative approach, and one that was chosen for the present study, conducts the genome-wide analysis on the entire sample across multiple ethnic backgrounds. This approach has been successfully conducted in other studies of complex disease (for examples, see Kurreeman et al. 2012 and Xu et al. 2013). However, rather than use the Add Health sample as a discovery sample (often requiring very large sample sizes) we adopted a weighting scheme based upon the GIANT consortium (Speliotes et al. 2010) that is comprised of a series of European descent samples. Therefore, the approach taken for the present study is one that explores the extent to which the variants discovered in European descent GIANT consortium may also be of relevance to the multiethnic Add Health pairs subsample.

We note here that there are a multitude of valid and reasonable approaches that investigators may take when conducting a genetic study of a multiethnic, family-based sample with longitudinal measures of phenotype, behaviors and the environment. For example, Add Health researchers may be interested in imputing genotypes for purposes of combining association signals across multiple data sources genotyped on different platforms. Add Health researchers may also be interested in incorporating Add Health sampling weights and/or taking into account phenotypic clustering particularly when studying genetic risk factors within an environmental context. Furthermore, the family-based structure of the data would allow for more specific analyses using informative sibling pair family units through a variety of family-based association approaches (e.g. FBAT, Laird et al. 2000).

Through the weighted genome-wide association analysis, we observed association signals that align with previous and in some cases, established genetic variants associated with BMI. For example, we were able to identify the FTO region that has been previously identified and replicated (Frayling et al. 2007). Additionally, we were able to identify variants that are upweighted through the GIANT consortium results, but do not achieve genome-wide significance in either GIANT or the Add Health sibling pairs subsample. These variants may be of particular interest for researchers who wish to explore GxE interactions in the Add Health sample to further explain the variability of the effect of these variants on BMI over time (age and development), behaviors and under particular environmental contexts.

The Add Health study is unique because of the explicit emphasis on properly characterizing the multilevel and multidimensional aspects of adolescents’ lives as they transition to adulthood. This design in conjunction with the related and unrelated pairs data (see Harris et al. 2013) has expanded the scope of the gene-environment interaction perspective to a multilevel perspective in which environmental influences are measured at the level of the state (Boardman 2009), neighborhood (Cleveland 2003), and schools (Boardman et al. 2012). Most importantly, the research design enables the measurement of factors such as social norms (Boardman et al. 2008) that are otherwise difficult to assess. The assessment of these contextual factors has been highlighted as a critical area for future research in gene-environment interplay (Spittel et al. 2013) and the utilization of genome-wide data in conjunction with this social environmental backdrop may provide important insights in the etiology of complex morbidities such as obesity (Boone-Heinonen and Gordon-Larsen 2012).

As described elsewhere (Boardman et al. 2013a, b), the existing gene–environment interaction typology includes models in which genetic risk may be the most evident in the least risky, the most risky, or the typical environments. Depending on the anticipated G×E relationship and the specific phenotype, environments may either trigger or control genetic expression in a causal manner, or they may simply mask otherwise small genetic associations. Without a representation of the full range of environments, one may conclude that a specific polymorphism is either protective, risky, or not associated with a particular phenotype. Belksy and Pluess (2009) make a very strong case for the differential susceptibility hypothesis that argues that environmentally sensitive loci will be protective in the most enriching environments but deleterious in unhealthy environments. This cross-over association cannot be identified without a representative sample from the full continuum of environments that is, again, why the representativeness of the Add Health study is such an important resource in conjunction with the pairs data.

Finally, it is important to note that GWAS is but one use of genome wide data. For instance, the genome-wide relationship models discussed above (Yang et al. 2010) can be extended to incorporate these multilevel design features so that the contextual variation in the heritability of health behaviors can be examined using measured genetic similarity rather than assumed similarity from sibling-based models. Similarly, sibling fixed effects approaches can take advantage of the “random assignment” of risk alleles to examine siblings residing and socializing in similar environments compared to those in very different social contexts (Fletcher and Lehrer 2011). These methods provide unique and new possibilities to identify causal models and have thus far not been extended to the genome-wide level.

Add Health sibling pairs subsample data access

The genome-wide data and phenotype measures used in this study will be made available to the scientific community through the NIH database of Genotypes and Phenotypes (dbGaP) by January 2015. Researchers interested in using the Add Health sibling pairs subsample genome-wide data will be required to access genotype data through the dbGaP authorized access system. Once genotype data are available through the dbGaP and access has been granted, researchers who request other phenotypic data not in dbGaP will be able to apply for a Genome-wide Data Restricted Access Agreement through Add Health beginning in 2015 (http://www.cpc.unc.edu/projects/addhealth). This process will allow approved investigators access to the entire Add Health sibling pairs subsample longitudinal data in addition to the genome-wide data.