Keywords

1 Introduction

With the rapid advances of high-throughput resequencing and marker genotyping, high-density genetic variation information (such as single-nucleotide polymorphisms, SNPs, and copy-number variants, CNVs) has been collected and need to be linked with functions. Over the past few years, a multitude of genome-wide association studies (GWAS) and related strategies have identified numerous genetic variants associated with complex diseases or other traits in humans and plants, providing valuable insights into their genetic architecture. These findings are definitely enriching our knowledge about the genetic basis of phenotypic variation and provide an opportunity for genetic testing. However, most variants identified so far explain only a small proportion of the causal genetic factors, leaving the remaining “missing” heritability to be explained [1]. Moreover, even with a complete understanding of the genetics of a complex phenotypic trait, it is still challenging to accurately predict phenotypic variation from individual genetic codes. Furthermore, the majority of these disease- or trait-related variants lie within noncoding regions of genomes, complicating their functional evaluation and offering the greatest challenge in the “post-GWAS” era [2].

Globally linking genetic variants to phenotypic diversity is one of the key goals of biology. Our understanding of such a genotype–phenotype map cannot be established without detailed phenotypic data [3]. However, our ability to characterise phenomes – the full set of phenotypes of an individual – largely lags behind our ability to characterise genomes. Hence, phenomics – high-throughput and high-dimensional phenotyping – is emerging as a suit of new technologies to accelerate progress in our understanding of the relationship between genotype and phenotype [3, 4].

In this chapter, we will first review the principle of dissecting genotypes and monitoring phenotypes, usually in high-throughput manners. We also highlight current approaches to obtaining phenomic data and the emerging applications of large-scale phenotyping approaches in the phenomics era. We then outline the current strategies, such as GWAS and analogous methodologies, for globally linking genetic variation to phenotypic diversity. We summarise insights about the complete “genotype–phenotype” map that could be established through integrating “omics” data at broad levels in terms of a systems-biology approach. Related phenome projects and phenomic tools are discussed. Please keep in mind that the results discussed here are mostly based on research in humans and/or plants and that only a subset of published information can be mentioned.

2 Defining the Genotype and Phenotype

In this section, we outline the state-of-the-art methods used for the assessment of genotypes and phenotypes and the corresponding mapping approaches for linking genotypes to phenotypes at global levels (Table 11.1). We also present phenomics-related projects that combine rich genomic data with data on quantitative variation in phenotypes and which have recently been launched in both humans and plants (Table 11.2). We highlight many emerging technologies developed for high-throughput phenotyping in plants (Table 11.3).

Table 11.1 Various approaches towards genotype–phenotype map
Table 11.2 Related projects or resources for phenomics studies
Table 11.3 Automated or semiautomated plant phenotyping platforms

2.1 Genetic Variation: Genotyping

Genotyping technology is referred to as the set of methodologies and protocols used to elucidate the genetic makeup (genotype) of an individual, also known as genotypic assaying. Genotyping is essential in deciphering the genetic causes of complex phenomena, including health, disease, crop yields and evolutionary fitness. Human genetic mapping was initially performed based on restriction fragment length polymorphisms (RFLPs) [5, 6], amplified fragment length polymorphisms (AFLPs) [7] and microsatellite markers (also known as short tandem repeats or simple sequence repeats) [8]. More recently, SNPs, due to their high abundance, low mutation rates and amenability to high-throughput analysis, have become the markers of choice for linkage and linkage disequilibrium (LD) mapping [9, 10]. The usually binary SNP markers are well suited to automated, high-throughput typing. Indeed, it is now feasible to genotype SNPs with high density at the genome-wide scale by utilising array-based [11, 12] or sequencing-based [13, 14] technologies (Table 11.1). Although high-throughput SNP arrays avoid time-consuming cloning and primer design steps, they lack of the discovery process and show bias towards genotyping new populations. Now, with the advent of next-generation sequencing (NGS), new technologies such as reduced-representation libraries (RRLs) [15] or complexity reduction of polymorphic sequences (CRoPS) [16], restriction-site-associated DNA sequencing (RAD-seq) [17] and low-coverage genotyping, including multiplexed shotgun genotyping (MSG) [18] or genotyping by sequencing (GBS) [19], are capable of genome-wide marker discovery for both model organisms and non-model species. Although sequence-level variants have been catalogued more extensively, structural variations – including indels (insertions/deletions), CNVs and inversions – are now investigated for their contribution to complex traits, including many important common diseases [20]. CNVs can be identified with various genome analysis platforms, including array-based comparative genomic hybridisation (CGH), SNP genotyping platforms and NGS.

Our knowledge regarding human genetic variations is mostly derived from the international effort of the SNP Consortium [21] and the International HapMap Project [22] (Table 11.2). Recent advances in sequencing technology make it possible to comprehensively catalogue genetic variation in population samples. Projects such as the Personal Genome Project (PGP) (e.g. diploid personal genomes [23]), the 1000 Genomes Project (TGP) [24] and exome sequencing projects [25] are under way in an attempt to elucidate the full spectrum of human genetic variations as a foundation to investigate the relationship between genotype and phenotype. For example, the Phase 1 publication of TGP in 2012 included whole-genome sequences of 1,092 individuals from 14 populations. A total of 38 million SNPs, 1.4 million short indels and more than 14,000 larger deletions were identified [26]. Notably, the genome of any apparently healthy individual carries more than 2,500 nonsynonymous variants at conserved regions, 20–40 variants identified as damaging at conserved sites and ~150 loss-of-function (LoF) variants in protein-coding genes, some of which are known to cause Mendelian disease [26].

Meanwhile, genome-wide genotyping is extensively performed in plants in recent years (Table 11.2), such as in Arabidopsis thaliana [27], rice [28], maize [29, 30], sorghum [31] and barley [32]. These rich resources will ultimately help to explore the genetic basis of plant agriculture-related traits, such as flowering time, growth rate, yield and stress tolerance, and to improve crops and understand plant adaptation.

2.2 Phenomics: Multilevel and Multidimensional Assessment of Features

The term phenotype includes the composite of an organism’s observable traits or characteristics – such as its morphological, developmental, physiological, pathological or biochemical properties, phenology and behaviour – that can be monitored, quantified and/or visualised by some technical procedure. Phenomics is defined as the study of all the phenotypes of an organism (phenome) that are the result of genetic code (G), environmental factors (E) and their interactions (G × E). In contrast to genotypes, which are essentially single one-dimensional as merely determined by the linear DNA code, phenotypes are usually multi-dimensional and are frequently capricious in different spatial and temporal situations. An important field of research today is trying to improve, both qualitatively and quantitatively, the capacity to measure phenomes. In broad definition, phenome includes epigenomics, transcriptomics, proteomics, metabolomics and many other “omics” data regarding quantitative measurement of biochemical and cellular processes. We have relatively well-developed technologies of measurements, in vivo or in destructive manners, of physiological states and other “internal phenotypes” (endophenotypes), such as gene expression, protein and metabolite levels, whereas our ability to measure “external phenotypes” (exophenotypes) is rapidly evolving.

We will never be able to come even close to a complete characterisation of the phenome due to its highly dynamic and high-dimensional properties. However, increasing the quantitative information obtained by phenotypic measurements is an important goal for phenomics [3]. Phenotypic variation, a fundamental prerequisite and the perpetual force for evolution by natural selection, results from the complex interactions between genotype and environment (G × E). Phenomic-wide data are essential and necessary for enabling us to trace causal links in the genotype–phenotype map (G-P map [33]) as they define the space of all possible phenotypes (P space; Fig. 11.1).

Fig. 11.1
figure 1

The genotype–phenotype map (G-P map). The left panel shows the relationship of the genotype space (G space) and the phenotype space (P space) [3]. The corresponding information that transmits from G space to P space is shown in the right panel. Genotypes could gain mutation and recombination over generations. Phenotypes can be broadly classified into internal and external phenotypes. These internal phenotypes include properties from molecular, cellular or tissue levels, which in turn shape external phenotypes such as morphology and behaviour. Upon the environmental stimuli, the epigenetic process creates the phenotypes using genotype information. External phenotypes can in turn shape the environment that an individual occupies, creating complex feedback relationships between genes, environments and phenotypes. Natural selection act in the P space to change the average phenotype of parents away from the average phenotype of the generation. The importance of the environment suggests that we should explicitly broaden the G-P map to the genotype–environment–phenotype (G-E-P) map. g: genotype; p: phenotype; ip: internal phenotype

High-throughput automated imaging is the ideal tool for phenomic studies. Owing to the recent increased availability of high-precision robotic handling machinery, many imaging-based technologies that span molecular to organismal spatial scales have been or are being established and enable us to extract multiparametric phenotypic information in great detail. Various detectors using a broad range of the electromagnetic spectrum and magnetic resonance imaging (MRI) with different scales of resolution are widely used imaging techniques for phenotyping [34]. High-dimensional spatiotemporal data on many phenotype classes such as morphology, behaviour, physiological state and locations of proteins and metabolites can be captured by these imaging techniques and analysed via high-performance computing [3]. In recent years, systems for performing high-content microscopy-based assays have become available and are often used to investigate the effects of chemical (such as drugs and small molecules) and genetic (loss-of-function of genes using RNA interference [RNAi]) perturbations on cultured cells [3542]. Such genome-wide RNAi screens enable us to discover novel gene functions and interrogate their functional relationships based on phenotypic similarity analysis [43, 44]. These screens produced huge amount of high-content image data that can be automatically processed using software tools such as ImageJ [45], EBImage [46], CellProfiler [47] or PhenoRipper [48]. Traditional microscopy is generally used in two-dimensional (2D) imaging. However, high-resolution and dynamic three-dimensional (3D) imaging data can be acquired by confocal laser scanning microscopy (CLSM), X-ray computerised tomography (CT) or MRI.

In plants, the “phenotyping bottleneck” [4] needs to be addressed by high-throughput noninvasive technologies [49]. Thanks to developed new imaging sensors (e.g. high-resolution imaging spectrometers) and the advanced software for image analysis and feature extraction, a range of automated or semiautomated high-throughput plant phenotyping systems (Table 11.3) have been recently developed and applied to assess plant function and performance under controlled conditions [5058]. One of the pioneer platforms, PHENOPSIS [51], was developed for the dissection of genotype × environment effects on different processes in Arabidopsis thaliana with reproducible phenotyping. TraitMill [50, 52], GROWSCREEN [53, 55, 59], LIMINA [54], HYPOTrace [56], HTPheno [57] and LeafAnalyser [58] provide general image-processing solutions for plant morphological measurements (such as plant height, length and width, shape, projected area and biovolume) and colorimetric analysis. Most recently, high-throughput phenotyping has been used for three-dimensional plant analysis [6064], focusing on a specific organs (e.g. leaves, roots and aerials). However, most of these tools possess the inherent disadvantage that they are designed to address only very specific question [65]. Among the advancing solutions, the state-of-the-art phenotyping platform developed by LemnaTec (http://www.lemnatec.com/) is a robotic greenhouse system that uses non-destructive imaging to monitor plant growth under controlled environmental conditions (such as nutrition, water availability, irradiation and temperature) over a period of time. Several ingenious imaging cameras, such as visible/colour/RGB (red, blue and green) imaging, fluorescence, thermal and near-infrared imaging, have been adopted in this system to assess the physical and physiological status of plants, such us their geometric properties, pigment or fluorophore contents, canopy temperature and tissue water content. LemnaTec systems have now been deployed in growth champers or greenhouses (e.g. at the Leibniz Institute of Plant Genetics and Crop Plant Research [IPK; Germany], the Australian Centre for Plant Functional Genomics [ACPFG] at the University of Adelaide [Australia], the Aberystwyth University [UK] and the PhenoArch at Institut National de la Recherche Agronomique at Montpellier [France]) for high-throughput phenotyping in Arabidopsis [66], wheat [67], barley [57] and maize (unpublished data). The time-lapse phenotypic data from these large-scale phenotyping platforms provide an invaluable opportunity to model and predict plant growth [67, 68]. Also, these data can be used to map quantitative trait loci (QTL) for growth-related traits. Notably, a recent phenotyping application was developed for QTL mapping in pepper plants using phenotypic features such as leaf angle and leaf size from RGB images, resulting in heritabilities of 0.56 and 0.70, respectively [69]. At the same time, however, the huge amounts of imaging data generated from these platforms present a great challenge for data analysis. As one solution, the Integrated Analysis Platform (IAP; http://iap.ipk-gatersleben.de) [70] is being developed as a comprehensive framework for high-throughput phenotyping in plants, which enables us to extract a high-dimensional list of plant features from real-time images to quantify plant growth and performance.

2.3 Defining Genotype–Phenotype Relationships

Understanding the interplay between genotype and phenotype (G-P map; Fig. 11.1) is the ultimate goal in both genomics and phenomics research, which will yield insights that are important for predicting disease risk and individual therapeutic treatments in human population, for increasing the speed of selective breeding traits in agriculturally import crops and for predicting adaptive evolution [71]. The interactions between genotypes and phenotypes also inevitably involve the environmental factors [3]. Thus, the interaction between genotype and phenotype has often been conceptualised by the following relationship: genotype (G) + environment (E) + genotype × environment (G × E) → phenotype (P). Since individuals themselves may influence the environment and exert different effects depending on their characteristics, feedback of phenotypes needs to be considered in this concept. Furthermore, the response of a certain genotype to an environmental factor may depend strongly on the phenotypic status of the individual, which is the result of events that occurred in its preceding life history. Towards understanding, the G-P map will provide a framework for the development of personalised medicine and crop breeding [72, 73].

Genomics and other highly parallel technologies – including epigenomics, transcriptomics, proteomics, metabolomics and ionomics – have become the mainstay in biological research. These recently developed technologies commonly termed “omics” permit assessment of the entirety of the components of biological systems at broad levels (Table 11.1). Furthermore, the emerging high-throughput phenotyping technology is moving towards comprehensive, quantitative high-dimensional measurements of individuals (phenome). However, our current knowledge of the genetic basis of complex phenotypic traits probably represents only the tip of the iceberg. Why do even genetically identical twins often substantially differ in phenotypic traits such as disease risk and drug response? Indeed, it is now understood that the differences are to a large extent result of the epigenome and involve chromatin modifications as well as myriads of noncoding RNAs (ncRNAs) [74, 75]. The emerging task is to understand the complex relationships among the genome, the epigenome, the environment and the phenome. The goal of globally linking genotype to phenotype can only be achieved through integrating information from different levels into an integrative model in terms of systems-biology approaches, which makes prediction of phenotypes possible (Fig. 11.2). This model should also consider the complex environmental factors in the real world, which need to be very precisely defined. For example, it is now possible to model rice transcriptome dynamics under fluctuating field conditions [76], rising hopes to predict genome-wide transcriptional responses in the complex real-world settings [77].

Fig. 11.2
figure 2

Chart flow of the assessment of gene function using quantitative trait locus (QTL) analyses. Genetic markers (DNA level) such as SNPs and CNVs can be genotyped using next-generation sequencing technology. Quantitative traits, such as DNA methylation level, transcript, protein or metabolite content and biomass can be analysed using different detection methods. The information flow is indicated with arrows. Environmental factors are also included. The data generated can be used for mapping to determine the genomic regions (QTLs) responsible for the observed variation. The identification of the causal genes underlying the QTL, and ultimately their functional characterisation, will be facilitated by the combined analysis of the data generated using different profiling techniques and additional information obtained using bioinformatics tools [78]. phQTLs: DNA methylation QTLs; eQTLs: expression QTLs; pQTLs: protein QTLs; mQTLs: metabolic QTLs; phQTLs: phenotypic QTLs; GWAS: genome-wide association studies; EWAS: epigenome-wide association studies; MWAS: metabolome-wide association studies

3 Approaches for Linking the Genome to the Phenome

3.1 QTL Detection Through Linkage and Association Mapping: Identifying the Genetic Basis of Complex Traits

Thanks to the advanced high-throughput experimental technologies such as microarray and sequencing, high-density genotyping arrays are available and are widely used recently to establish large-scale genome-wide maps of QTLs for various phenotypes such as human diseases and agricultural traits [20, 7981]. Genome-wide association studies (GWAS, also called association mapping) are becoming the preferred method to relate genetic variation to phenotypic diversity in populations of unrelated individuals. The most common polymorphic markers used for GWAS are sequence polymorphisms such as SNPs and structural variants such indels and CNVs [20]. GWAS are now preferred over traditional family-based linkage studies (linkage-based QTL mapping; Fig. 11.3a) [82], which use interval mapping to estimate the map position and effect of each QTL.

Fig. 11.3
figure 3

Principle of quantitative trait locus (QTL) mapping. (a) Linkage-based mapping versus association mapping. The purpose of QTL mapping is to uncover the genetic basis of quantitative traits of interest. Linkage-based analyses seek to identify segregating genetic markers (M1, M2, M3 and M4) that predict the organismal phenotype, using a population that carries genetic mosaics derived from parental varieties, such as second generation (F2) plants or recombinant inbred lines (RILs). The relationships of individuals are known (P1 → F1 → F2 → RILs). It should be noted that RILs, rather than the F2 or F3 population, are needed to evaluate genotype-by-environment interactions. The region highlighted in yellow indicates the position of a causal locus or QTL. Association mapping (analogous to genome-wide association study [GWAS]) relies on correlations between genetic markers and a phenotype among collections of diverse germplasm. Thus, the recombination used in this strategy is historical. As shown in the figure, the association mapping population is separated by many generations from its progenitors. In linkage-based studies, the haplotype blocks in the mapping population may be large and, as a consequence, the causal locus might only be mapped to a large region. The haplotype blocks in an association mapping population tend to be much smaller, so it might be possible to localise the causal locus to a small genomic region. Within the QTL region, relevant genes may be identified for future studies or candidates may be suggested for targeted sequencing or experimental perturbation. (b) Conception of the intermediate phenotype used QTL mapping. The association of genetic variants is strongest with their closest intermediate phenotypes (IPs), such as variation of DNA methylation (methQTLs), transcript (eQTLs) or protein (pQTLs) content and metabolic traits (mQTLs). In some cases, the association of genetic variants with the organismal end point may not even be detectable at a level of genome-wide significance. (c) Relationships of GWAS and QTL mapping methodologies in integrative analyses (Part a is reproduced, with permission, from Mackay et~al. [98], Copyright 2009, Macmillan Publishers Ltd. Part b is adapted from Suhre and Gieger [136]. Part c is reproduced from Cookson et~al. [123])

GWAS use dense maps of genetic markers that cover the whole genome to look for allele-frequency differences between cases (e.g. patients with a specific disease or individuals with a certain trait) and controls. Several powerful statistical methods have been established to associate common complex trait with genomic variations, including efficient mixed-model association (EMMA) [83], EMMA expedited (EMMAX) [84], genome-wide EMMA (GEMMA) [85], mixed-model and regression (GRAMMAR) [86], fast linear mixed models (FaST-LMM) [87], general linear model and mixed linear model implemented in TASSEL (Trait Analysis by aSSociation, Evolution and Linkage) [88] and the EIGENSTRAT method [89]. In the past few years, intensive efforts in more than 1,500 GWAS have uncovered hundreds of genetic variants associated with hundreds of diseases and other traits [90], providing valuable insights into the complexities of genetic architecture of human diseases. Although disease-associated variants in protein-coding regions are expected to be more importantly related to trait/disease diversity, the vast majority (80 %) of variants are found to fall outside coding regions, highlighting the importance of noncoding regions in the search for disease-associatedvariants [1, 90]. However, the identified loci thus far explain only a small fraction of the phenotypic diversity in humans, raising questions regarding “the missing heritability” [1, 91]. An informative example is the investigation of height in humans, which is 80–90 % heritable, but a list of loci that has been detected in GWAS together accounts for less than 5 % of heritability for height [92]. Several explanations for this missing heritability have been proposed, including rare variants, allelic heterogeneity, epigenetic variation (see the next section), CNVs, gene–gene interactions and, perhaps most importantly, the environmental uncertainty [1, 91]. Intriguingly, GWAS have shown to be even more successful in plants than in humans [93], the key observation being that initial GWAS in plants (e.g. in Arabidopsis [94], maize [95, 96] and rice [28]) have explained a much greater proportion of the phenotypic variation. Perhaps the best example is a study in rice [28], in which the authors performed low-coverage resequencing of the genomes of a panel of about 500 rice landraces and identified 80 loci associated with 14 agronomic traits, explaining on average ~36 % of the phenotypic variance. Several of these loci matched previously characterised genes. The ongoing development of technologies in both genotyping for detection of CNVs and other structural variants and statistical methods for accurate association testing will help us to examine potential sources of missing heritability and to better illuminate the causality of complex traits/diseases.

Linkage-based QTL mapping approaches have proved to be enormously successful for plant breeding and have identified loci with large effects of genetic variants on complex traits, which include most agriculturally important traits [81, 97]. The primary advantages of QTL mapping in plants are the great feasibility of creating populations of segregating individuals showing measurable phenotypic variation. However, the generation of crosses is time-consuming, and there is the necessity to focus on traits that can be readily and accurately phenotyped. Furthermore, due to the low frequency of recombinations represented in biparental mapping populations, causal loci (QTLs) identified by linkage-based strategies can only be mapped to large chromosomal regions, and tedious fine mapping needs to be carried out to narrow down on candidate genes that can be subjected to targeted sequencing or experimental perturbation [97, 98].

The emergence of a next-generation of mapping populations [97] overcomes many of the limitations of biparental QTL mapping and association mapping. Such experimental designs combine association and linkage analysis as they involve the crossing of multiple parents and advance populations through several generations to increase allelic richness and to improve resolution in genetic mapping. Such designs include the nested association mapping (NAM) [95, 99, 100], the multiparent advanced generation intercross (MAGIC) [101, 102] and the recombinant inbred advanced intercross line (RIAIL) [103, 104] populations.

In a further aspect, it needs to be mentioned that genomic selection (GS) [105], a genomics-based strategy for predicting phenotypes by the use of genome-wide marker data, is receiving considerable attention among (animal and) plant breeders. Similar to linkage and association mapping methods, GS starts with the development of a prediction model on a training population with individuals characterised for genotype and phenotype. Unlike linkage and association mapping approaches, GS models consider all markers as predictors and can thus capture more of the variation due to small-effect QTLs. Most importantly, the training population used in GS is generally closely related to the breeding population under selection. This situation supports the use of GS models for most accurate predictions for breeding [106].

3.2 EWAS: Linking Epigenetic Variation and Complex Traits

In addition to genetic variability, epigenetic factors including DNA methylation, histone modifications and ncRNAs (e.g. small interfering RNAs [siRNAs], microRNAs [miRNAs] and large intergenic ncRNAs [lincRNAs]) are considered as the missing part of the underlying molecular control of phenotypic variation (Table 11.1) [71, 75]. DNA methylation is the most studied epigenetic modification, and its variation at a single CpG (cytosine–guanine dinucleotide) site (known as a methylation variable position, MVP), CHG (H = A, T or C) or CHH contexts or a differentially methylated region (DMR) can be considered as the epigenetic equivalent (heritable epigenetic polymorphism) of an SNP in the context of genome [107]. While the DNA-centric model (e.g. GWAS) has allowed scientists to uncover the molecular genetic origins of Mendelian traits and diseases successfully, many complex traits and diseases are non-Mendelian, making them hard to explain. Due to the elasticity and plasticity of epigenetic factors, epigenetics can provide a novel framework for the identification of aetiological factors in complex traits and diseases [108]. The direct evidence that epigenetics could “make the difference” comes from the remarkably different epigenetic profiles, including disease-associated epigenetic differences, in human monozygous (MZ) twins, who share an identical genotype [109111]. Indeed, with the recent advances in genomic technologies, the large-scale, systematic epigenomic equivalents of GWAS, termed as epigenome-wide association studies (EWAS), are emerging as the promising tool to investigate human disease-associated epigenetic variation [71]. However, it is still challenging in EWAS to distinguish whether epigenetic variation is the cause or functional consequence of the identified effects. In this regard, the sample used in an EWAS should ideally consist of MZ twins, to eliminate the influence of genetic background on the identified epigenetic variation [71] and as recently demonstrated by several studies [112115]. Analysis of epigenetic variation is likely to be most successful when integrating the analysis of genetic variants (i.e. QTL mapping), leading to the identification of the underlying genetic variants that influence epigenetic state (epigenotype). The loci that harbour genetic variants corresponding to methylation states (e.g. MVPs or DMRs) have thus been termed methylation QTLs (methQTLs) [116]. The most pronounced methQTLs influence epigenetic states in cis, and they reside less than 50 bp from the CpG site in question [112]. The notion of methQTLs provides a general idea for integrated GWAS and EWAS (Fig. 11.3) to explore genotypes that exert their function through epigenetic mechanisms, which can be maintained and propagated during cell division, resulting in permanent maintenance of the acquired phenotype [71, 108, 117].

At the same time, there is also evidence from plant research communities that naturally occurring epigenetic changes (i.e. DMRs) in a single gene locus (epiallele) can lead to heritable phenotypic variation [118122]. The epialleles often show increased cytosine methylation of the promoter and can result in nearby gene expression changes that are sometimes transmitted across generations, thus contributing to heritable phenotypic variation independent of DNA sequence diversity. These outstanding resources will advance our understanding of the relative roles of genetic and epigenetic variation in controlling quantitative trait variation in plants.

3.3 Variation in Gene Expression: From eQTLs to Phenotypes

Variation in gene expression is an important mechanism underlying phenotypic variation such as disease susceptibility and drug response. DNA variants may alter transcript abundance and splicing patterns through modification of regulatory elements [123]. Genomic loci responsible for this genetic control are consequently termed expression QTLs (eQTLs). The combination of high-throughput phenotyping and transcriptional profiling has allowed the systematic identification of eQTLs (Fig. 11.3) [98]. In principle, eQTL mapping uses transcript abundance as a phenotypic trait and maps the genomic loci controlling the transcript level, as performed in the same manner of traditional QTL mapping of any other quantitative trait phenotype [124]. According to the genomic context of transcripts, eQTLs can be categorised into cis eQTLs if the molecular variants (e.g. SNPs) are mapped to the approximate location (within 100 kb upstream and downstream [112, 125]) of their gene-of-origin transcripts and trans eQTLs in other cases. Further statistical analysis revealed a strong enrichment of cis eQTLs around transcription start sites (TSSs) and within 250 bp upstream of transcription end sites (TESs) [126]. The cis-acting variants are more likely in exonic regions than in intronic regions. Given that genetic variation in the 3′UTR of a gene may create or destroy a miRNA binding site [127], the cis effects are likely mediated through miRNA-regulated pathways. Besides this, cis-acting variants in promoter or enhancer regions may influence the binding of transcription factors and thus promoter regulation. Nevertheless, it is still not known whether trans effects are mediated through transcription factor variants or through other mechanisms [123]. Generally, cis eQTLs tend to have stronger influence on target gene regulation than trans eQTLs. Moreover, there exist the so-called eQTL hot spots in which the expression levels of many transcripts are associated with the variation.

The resulting comprehensive eQTL maps provide potential insight into a biological basis for complex quantitative trait associations identified through GWAS [123]. Since the expression of transcripts is subject to intensive gene regulation, eQTL data should be interpreted further by the incorporation of additional biological information, such as results from GWAS and EWAS as discussed above, and analysis of regulatory networks, which are discussed below. This kind of integrated analyses has been utilised in several studies [112, 114, 115, 128, 129].

Proteins are mainly responsible for the biological phenotype; they thus should more accurately reflect the cellular physiological state or the changes induced by disease processes, drug treatment or other influences, compared with genetic, epigenetic or transcript variants. Various mechanisms of post-transcriptional regulation can lead to changes in protein abundance in the absence of a corresponding alteration of transcript levels, suggesting that the proteome is expected to provide important biological insights and disease biomarkers that cannot be captured through evaluation of the transcriptome alone [130]. We mention here that association mapping analysis could also be done at the protein level in terms of protein QTL (pQTL or PQL [131]) mapping, in which protein abundance or modification is treated as a phenotypic trait. pQTL mapping, complementary to eQTL mapping, is now becoming feasible with technical advances in mass spectrometry (MS)-based proteomics [130, 132, 133]. The little overlap between pQTLs and eQTLs from the same study [134] indicates that the proteome and the transcriptome give distinct insights into the diversity between different individuals and further highlights the implications for systems-biology approaches that utilise such high-throughput data into integrated analysis.

3.4 Genome-Wide Association Studies with Metabolomics: Metabolic QTL Analysis

In addition to genomics, epigenomics, transcriptomics and proteomics, metabolomics is emerging as a complementary approach for globally measuring ideally all endogenous small organic molecules (metabolic traits; normally below 1,500 Da) in a biological sample. However, unlike the transcriptome and to a lesser degree the proteome, the metabolome is much more amenable to variation. The metabolome is much more diverse in terms of chemical structure and function [135]. Metabolite profiles capture important information on the environment (diet, lifestyle, gut microbial activity and bacterial activity) that individuals experience and can give an instantaneous snapshot of the individual’s physiological state at that particular time under a particular set of conditions. Some changes in metabolite levels may be a consequence of the phenotypic diversity; therefore, a metabolic trait presents a functional intermediate trait or merely a correlated biomarker [136]. Noninvasive metabolic methodologies include nuclear magnetic resonance (NMR) spectroscopy [137], MS and high-performance liquid-phase chromatography (HPLC). Due to advances in these technologies, quantitative readouts for hundreds of small molecules that are detected in large scale can now be provided. Experimental design concerns the choice of which metabolites to study. While targeted methods provide precise measurements of specific (known) metabolites and are easy to replicate, nontargeted approaches are currently more promising as they provide the opportunity to discover novel associations including hitherto uncharacterised metabolites [136].

In the past few years, GWAS face the challenge that the effect of sizes of genetic association is generally small and information on the underlying biological processes is lacking [136]. These problems can be overcome, at least partially, by association with metabolic traits as functional intermediates [138]. There is the increased interest from the scientific community, and particularly plant biologists, in integrating metabolic approaches into research with the aim to unravel phenotypic diversity and its underlying genetic variation [78]. The combination of high-throughput metabolic phenotyping with general QTL analysis has thus given birth to the emerging field of metabolome-wide association studies (MWAS; Fig. 11.3).

The study of the chemical composition (i.e. the metabolite) of plants has always been of great interest in biological research, in part because metabolic phenotypes (metabotypes) largely reflect the developmental stage of the plant and its interactions with the environment. In plants, the first studies combining metabolic phenotyping with QTL analysis were performed in tomato [139141] and successfully uncovered loci (metabolite QTLs, mQTLs) regulating plant metabolite composition. In Arabidopsis [142147] and other crops, such as Brassica napus [148, 149], potato [150], rice [151] and maize [138, 152], mQTL mapping analyses have also been implemented using targeted and nontargeted metabolic profiling. Metabolite profiling-based approaches furthermore provide important steps towards the goal of hybrid performance prediction [152] and metabolomics-assisted crop breeding [153].

Similar MWAS were later performed in human studies [154158]. Large panels of metabotypes have been analysed in association with genetic variants, disease-related phenotypes and lifestyle and environmental parameters, allowing dissection of the contribution of these factors to the aetiology of complex diseases [136]. These MWAS have identified genetic factors reliably that influence intermediate traits on phenotypes such as blood pressure [158], cardiometabolic disorder [157] and coronary heart disease [159]. In summary, incorporation of GWAS and metabolomics further refine the G-P map and eventually identify possible prognostic or diagnostic biomarkers of disease risk and biomarkers for predictive plant breeding.

3.5 Systems Biology: Genome-Scale Networks That Link Genes to Phenotypes

Associating sequence-level variation (such as SNPs and CNVs) with high-level variation in organismal phenotypes (such as disease susceptibility or crop yield) omits all of the intermediate steps in the chain of causation from genetic perturbation to phenotypic diversity. As mentioned above, intermediate molecular phenotypes (endophenotypes) such as epigenetic variation, transcript/protein abundance and metabolic traits vary genetically in populations and are themselves quantitative traits [98]. These endophenotypes functionally link genetic variation to disease-predisposing (for human) or biomass-predisposing (for plants) factors and then to complex phenotypic end points. Excitingly, the so-called “genetical genomics” approach [160] now enables us to integrate genetic variation, various endophenotypic variation and variation in organismal phenotypes in a linkage or association mapping population in both human [161] and plants [162], allowing to interpret quantitative genetic variation in terms of biologically meaningful causal networks of correlated transcripts.

However, it is becoming clear that each of the intermediate steps in translating biological information from genotype to phenotype does not stand alone [135]. The omics technologies now enable us to understand the biology inside the “black box” that lies between genotype and phenotype in terms of complex interacting networks [135, 163] (Fig. 11.4). Although we are still far away from a holistic understanding of the G-P map, systems biology is an emerging approach that aims to elucidate higher-level behaviour of biological systems and focuses on complex interactions within them, illuminating the path towards this ultimate goal – the complete G-P map. The integrative systems approach tries to link together the single-level omics data (e.g. genome, epigenome, transcriptome, proteome and metabolome) and, over time (if available [164]), to reveal and model the dynamic molecular regulatory networks or pathways from gene-to-function in order to bridge from genomics to phenomics. With the availability of increasingly powerful omics-based technologies, analytical and statistical tools and integrated knowledge bases, it has become possible to establish new links between genes, biological functions and a wide range of human diseases [165179]. The comprehensive gene-disease associations present important insights that different disease modules (i.e. diseases share common genetic origins) could overlap and perturbations caused by one disease could affect other disease modules [180]. The identification of disease modules leads to the concept of the diseasome [165], which represents disease networks whose nodes are diseases and whose links represent the shared molecular relationships between the disease pairs. The underlying disease-associated cellular components are mostly investigated with protein-coding genes [165, 166, 168, 176, 177], though miRNAs [173, 178, 181], large intergenic noncoding RNAs (lincRNAs) [175] or metabolic pathways [171] are also investigated. Importantly, uncovering such diseasome networks provides hints on how different phenotypes are linked at the molecular level.

Fig. 11.4
figure 4

Schematic diagram depicting the strategy for integrated analysis of genetic and omic data. Large-scale genotyping and phenotyping are performed on segregating populations. Quantitative traits can be analysed on different levels to identify responsible loci (QTLs) based on QTL mapping approaches. Retrieved data can also be used in cluster analyses to identify gene-centred networks. The methodology of the combined used of genetic and omic technologies is commonly referred to as “genetical genomics” [160] and enables the elucidation of complex gene–phenotype networks (the G-P maps). This figure extends the work from Keurentjes [135]

Although GWAS and analogous methodologies have presented large numbers of disease-gene candidates, it still has the difficulty to identify the particular gene and the causal mutation [180]. A series of sophisticated strategies have recently been developed to predict potential disease genes (Fig. 11.5). These network-based tools include linkage methods [182], functional module-based or “guilt-by-association” methods [166, 176, 177] and diffusion-based methods [183, 184]. Furthermore, it is believed that genes tend to work in evolutionarily conserved pathways or modules; so the G-P maps can potentially be transferred between different species. Based on this assumption, orthologous phenotypes (phenologs) can be used to systematically predict genes associated nonobviously with diseases across different organisms using overlapping sets of orthologous genes [185]. In summary, the value of these tools is expected to increase with the wealth of disease gene candidates beyond GWAS. Although most of the initial studies based on these tools were performed in humans, similar strategies can also be applied to the plant biological research [186]. Indeed, networks for Arabidopsis [187], rice [188, 189] and maize [189] have been shown to connect thousands of genes accurately to phenotypes.

Fig. 11.5
figure 5

Methodologies for identifying trait-associated gene candidates. (a) Linkage methods. These methods combine both the linkage analysis (to determine the linkage interval of a specific trait) and protein–protein interaction (PPI) information. Genes (denoted as G1, G2 and so on) located in the linkage interval whose protein products interact with a known trait-associated protein are considered likely candidate genes. (b) Functional module-based or guilt-by-association methods. Function modules are identified from clustering analysis of genome-scale networks. The members of such modules are considered candidate genes linked to specific phenotypes. (c) Diffusion-based methods. Starting from proteins that are known to be associated with a phenotype, a random walker visits each node in the interactome with a certain probability. The outcome of this algorithm is a trait-association score that is assigned to each protein, that is, the likelihood that a particular protein is associated with the phenotype. (d) Phenologs (orthologous phenotypes). Phenologs is used to map phenotypes between organisms based on significantly overlapping sets of orthologous genes. Perturbation of overlapping modules of orthologous genes may result in one set of phenotypes in one organism but a different set of phenotypes in another organism. The genes in such modules are considered candidates associated with the corresponding phenotypes (Parts a–c are modified from Barabasi et~al. [180]. Part d is modified from McGary et~al. [185])

4 Perspectives and Future Challenges

The basic requirements for building an ideal phenomics realm are easy to imagine but still hard to realise. We are facing great opportunities but also great challenges in the areas of both genomics and phenomics. Although technically feasible, extensive and intensive measurement of genetic contents (such as epigenetic modification, gene expression, metabolite content) on large samples of genotypes across the full range of spatial and temporal scales is costly. Furthermore, the high density of genetic markers identified thus far yet awaits to be linked to their consequential phenotypic traits. On the phenomics side, the major challenge resides in the multitudes of phenotypic traits and environmental influences. The cost of a phenome project using current technology is extremely high [3]. High-throughput and high-resolution phenotyping technologies, for detection of both internal and external phenotypes, especially in plants, have started to open new horizons [3, 49]. Extracting as much quantitative information as possible from phenotyping data is a fundamental goal for phenomics. In other words, future phenomic efforts need to focus on comprehensive and quantitative measurements of phenotypes, rather than conventionally low-dimensional and qualitative phenotype categorisations [3]. Developments in phenomics will increase both the number of phenotypic traits that are quantitatively assessed and the sample sizes (number of individuals or genotypes characterised), resulting in major challenges with respect to data analysis. The available state-of-the-art methods, such as partial least squares (PLS) regression, principal component analysis (PCA), random forests (RF) and support vector machines (SVM), can be used to address the high-dimensional phenomic data. Another challenge in new analytics is automated analysis of phenotyping data, since navigating the huge imaging data sets manually is extremely tedious.

Regarding linking genotype to phenotype, many important challenges remain: (a) with respect to the problem of linking genes to traits, according to the observation of vast numbers of associated variants located within noncoding regions of the genome [90]; (b) with respect to epistatic interactions [190]; (c) with respect to gene-environment interactions [191]; (d) with respect to epigenetic influences on phenotypic variation; and (e) with respect to variation in the outcome of mutations among individuals [73]. One promising solution here is to combine data from multiple “omics” technologies in what may be termed “a genome-wide systems-biology approach”.

In a nutshell, however, phenomics lags largely behind genomics. In contrast to the situation in humans, in plant organisms it is relatively straightforward to carry out systematic genetic screens and large-scale phenotyping under various controlled environments. This provides unbiased assessment of the genetic complexity of phenotypic traits [73]. The G-P maps are therefore ultimately expected to be more complete and more systematic in plants than they may be in humans. Notably, many ongoing developing or developed phenomics tools will give plant scientists the power to unlock the information coded in genomes (Table 11.3). In the near future, the plant phenotypic landscape will be populated at a faster pace to accelerate research in model organisms and to bridge the gap between genomics and phenomics [3, 49].