On the term “phenotype”

As biologists, do we really need to use the term “phenotype”? After all, an individual’s “trait(s)” or “character(s)” fit fairly well with both Wilhelm Johannsen’s (1909) historical definition of the phenotype (“All ’types’ of organisms, distinguishable by direct inspection or only by finer methods of measuring or description [...]”) and the etymology of the word phenotype (from the Greek , phaínô [”to shine, to show, to appear”] and , tupos [“mark, type”]). And if one argues that the term phenotype, used in opposition to “genotype”, is essential in genetics, we can point out that T. H. Morgan and his collaborators, in their seminal book The Mechanisms of Mendelian Heredity (1915) never themselves used either of these two terms, although they gave a precise account of Johannsen’s work on bean where the distinction between genotype and phenotype was proposed. Throughout the book, Morgan et al. mention on the one hand “characters” and on the other hand “factors” (or Mendelian factors, or unit factors). Why did they not use the words “phenotype” and “genotype”, when for W. Johannsen these terms were fundamental, and have subsequently been adopted in almost every field of biology?

Relying on Mendel’s (1866) principles and exploiting the then recently discovered phenomenon of genetic linkage, T.H. Morgan et al. associated discrete variations in observable traits in Drosophila with factors (genes) located on chromosomes. Their results provided strong support for the chromosomal theory of heredity and laid the foundations for genetics. In their approach, it is the clear-cut observable difference in a trait, for instance pink or white eye color, that allows the inference of a change in a specific factor located at a particular chromosomal locus. There is a direct correspondence between trait variation and gene variation, unless there is complete dominance—but in this case additional progenies resolve this ambiguity.

Johannsen’s brilliant proposal for distinguishing phenotype from genotype came from his correct interpretation of simple selection experiments on bean seed size, a polygenic trait displaying continuous variation. Johannsen (1909) observed that in homozygous stocks there was no response to selection for large or small seeds, whereas in genetically heterogeneous populations, selection was effective. This meant that a fraction of the variation had non-genetic causes—such as the position of the bean within the pod or the position of the pod on the plant—making selection ineffective, while another fraction had genetic causes, allowing a response to selection. Comparing distributions of seed sizes in various bean lines, Johannsen (1911) noted: “The pure lines show transgressive fluctuation: it is mostly impossible to state by simple inspection of any individual bean the line to which it belongs”. In other words, a given genotype may display different phenotypes and a given phenotype may correspond to different genotypes. As a consequence, it became necessary to distinguish the non-genetic factors from the heritable factors of phenotypic variation. This led Johannsen to distinguish the “genotype” (“A ’genotype’ is the sum total of all the ’genes’ in a gamete or in a zygote”) from the phenotype. Thus, the word “phenotype” implicitly contains the genetic and non-genetic sources of variation of the trait under study as well as the interaction between the two. The non-genetic sources of variation are summed up by the term “environment”, a word that must be taken in a broad sense because it encompasses all possible biotic and abiotic influences including, for example, an individual’s age or maternal effects.

The distinction between genotype and phenotype is sometimes compared to the difference between germplasm and soma. In 1892, August Weismann (1892) proposed that in animals germ cells alone transmit hereditary material, while somatic cells, which differentiate because they contain various fragments of the germplasm, have no influence on germ cells, making the inheritance of acquired characteristics impossible (it should be remembered that  Lamarck [1744–1829] was not the only one to propose a mechanism for the inheritance of acquired characteristics: Darwin (1868) also developed the “hypothesis of pangenesis”, which was published nine years after his famous book On the Origin of Species). In actual fact, the correspondence between genotype-germplasm and phenotype-soma is superficial if not misleading. Like the germplasm, the genotype is assumed to be shielded from external influences. However, the genotype–phenotype distinction does not depend on a model of differentiation—it does not presuppose the existence of germinal and somatic lines and it also applies to plants and unicellular organisms—and it is primarily concerned with the question of “fluctuations” of a trait, which can arise from genetic and/or non-genetic causes.

The distinction between genotype and phenotype provided geneticists and evolutionary biologists with an essential conceptual framework. Along with various experimental results showing that Mendelian genetics may account for the inheritance of continuous variation (e.g. Nilsson-Ehle 1909; East 1916; Shull 1908), it helped reconcile so-called Biometricians and Mendelians who fiercely debated the origin of novel species—either through gradual change or by mutations of large effect (Olby 1989). In 1918, the work of Fisher (1918) definitely united genetics and evolution and his statistical modeling of the genotype–phenotype relationship remains a cornerstone of quantitative genetics and evolutionary thinking.

What does genotype–phenotype “relationship” mean?

Regarding the seven characters he studied in garden pea, Mendel (1866) wrote that each character he selected for his experiments showed a difference in either form or coloration (e.g. round or wrinkled seeds, green or yellow pods). Similarly, for Morgan et al. (1915), “unit factors” (or Mendelian factors) are responsible for a difference observed in a trait, and not for the trait itself. They illustrated this from three observations: (i) a trait can be modified by mutations in different loci (they counted up to 25 mutations affecting eye color in Drosophila); (ii) a given mutation can alter different traits (the factor for rudimentary wings also affects the legs, the number of eggs laid and their viability). This is pleiotropy, but they did not use the term—see below; (iii) multiple alleles can be found at a given locus, resulting in various manifestations of a trait (for instance white/red/eosin eye color). In addition, they reported various cases where a genetic difference is not visible at the phenotypic level, due to environmental influences or other genetic factors. For instance, in Primrose the red-white genetic difference in flower color is no longer visible when plants are grown at \(30^{\circ }\)-\(35^{\circ }\) because at high temperatures all flowers are white. In Drosophila the pink-vermilion genetic difference in eye color is not visible if the fly is homozygous white/white at one of the other loci influencing eye color. This is an example of epistasis, which was already well known at the time, since it was coined by Bateson as early as 1907.

In principle, for any biologist and all the more so for any geneticist, it should be clear that the expression “genotype–phenotype (GP) relationship” (or GP map) actually means “the relationship between genotypic difference and phenotypic difference”. However, in the reductionist approach of molecular biology and genomics, this expression is also used, or understood, in terms of the mechanistic causal chain of events between a gene and a trait. This improper conception is implicit in most graphical representations of the GP relationship, where simple arrows link genotypes to phenotypes, or genes to traits (Lewontin 1974; Wagner 1996; Houle et al. 2010), and leads to commonly encountered shortcuts such as “the gene(s) for trait X”, both in the general press but also in scientific journals. In the field of behavioral genetics, where phenotypes are difficult to define, this can result in weird claims such as “The MAOA gene predicts happiness in women” (Chen et al. 2013) or “The role of the CHRNA4 gene in Internet addiction: a case-control study” (Montag et al. 2012).

The flow of scientific knowledge sometimes looks like a treadmill, with well-established concepts that are rediscovered decades later or updated with novel techniques and new data. This is the case for the GP relationship, which was recently the subject of a detailed update (Orgogozo et al. 2015). What is intriguing is that the ideas proposed in this paper are almost all in Morgan et al.’s book (1915), published exactly one century earlier (albeit without the words genotype and phenotype—see above). The authors of the recent review insist on the differential nature of the genotype–phenotype relationship, developing points such as “genes as difference makers”, “the GP relationship is between two levels of variation”, “the differential part of a GP relationship”, etc. And like Morgan et al., they tackle “the problem of pleiotropy” and “the problem of epistasis and GxE” (Genotype x Environment interactions). Of course, the examples they give rely on data obtained using modern approaches and tools, but the conveyed message is strictly the same. The fact that the authors took the—timely—initiative to write such an article is indicative of the general ignorance within the biological research community of one of the historical foundations of genetics, namely that a gene does not make a trait, but that a genetic difference makes a phenotypic difference. An interesting extension of this notion, which Morgan et al. could hardly conceive, is that some GP relationships can transgress species: orthologous loci in extremely distant taxa can cause similar phenotypic differences (Martin and Orgogozo 2013).

The phenotype, a flexible concept

Initially, the term phenotype was applied mainly to visible macroscopic traits, such as size, shape, color, growth rate, grain number, seed coat patterns, etc. In actual fact, the way “phenotype” is defined imposes no limit on its use, which was extended in two ways.

First, phenotypes can be measured at every level of biological organization. For example, transcript, protein and metabolite abundances, as well as telomere length, epigenetic features, etc., are phenotypes amenable to genetic or evolutionary studies (Damerval et al. 1994; Brem et al. 2002; Johannes et al. 2008; Cook et al. 2016). Thus, transcriptomes, proteomes, metabolomes and epigenomes have become inexhaustible sources of molecular phenotypes. On an even finer scale, single macromolecules can be characterized by several phenotypes. For instance, Savir et al. (2010) measured four kinetic parameters (\(k_\mathrm {cat}\) and \(K_\mathrm {M}\) for \(\mathrm {CO}_2\) and \(\mathrm {O}_2\)) of the Rubisco protein, an essential enzyme involved in the first step of carbon fixation, in 27 photosynthetic species. Analysis of the correlations between parameters suggested that the evolution of this protein was constrained by a trade-off between speed and specificity. At the highest level of integration, individual fitness can be seen as the “ultimate” phenotype, which depends on all its genetic and non-genetic components. In this regard, the phenotype can even transgress the properties of the organism. According to Dawkins’ (1982) “extended phenotype” concept, the effects of an individual on its biotic and/or abiotic environment can affect its fitness, meaning that some fitness components are external to the body of an individual. For example, the manipulation of host’s behavior by some parasites can increase the parasite’s reproductive success (Andersen et al. 2009).

Second, the notion of phenotype can be extended by taking into account quantities that are inferred from mathematical functions. The easiest and most direct approach to characterize an individual’s phenotype is from single values, e.g. weight, size, compound concentration, organ number, etc., measured at a given time/age in a given environment. However, most traits of an organism vary in a complex way over time/age and/or environment, so that discrete measures are far from sufficient to capture the phenotypic features that are relevant in terms of fitness. As a consequence, many phenotypes are actually parameters calculated or estimated from mathematical functions, the so-called function-valued traits (Kingsolver et al. 2001). It is now commonplace to genetically dissect traits like growth rate, developmental parameters, photosynthesis rate, locomotion and nutrient uptake, but also the parameters of eco-physiological models (Reymond et al. 2003), biological rhythms (Takahashi et al. 2008) or reaction norms (Stratton 1998). “Hidden” traits can also prove to be quite relevant, such as gene expression noise (Ansel et al. 2008) or recombination rates (Petit et al. 2017). Note that at higher levels of organization phenotypes may be hard to define and quantify. For instance, traits pertaining to human behavior and psychiatric disorders pose specific problems that are not necessarily solved by resorting to “endo-phenotypes” or “intermediate” phenotypes (see Fisch (2017) for a discussion). Thus, there is no limit to the number of measurable or calculable phenotypes at all levels of biological organization and over the full range of spatial and temporal scales. For that reason, the goal of “phenomics”, which aims to characterize the phenome, i.e. “the full set of phenotypes of an individual” (Houle et al. 2010), is laudable but appears a pipe dream. In addition, there is a big-data challenge, as underlined for instance in the context of high-throughput phenotyping in plants. Some dozen robotic platforms have been established in various countries for phenotyping important crops and model species under controlled conditions, using recent sensor and imaging techniques. The traits considered, extracted from videos and sensor data, are diverse: biomass, yield, 2D and 3D architectural traits, growth model parameters, oil/protein/carbon/nitrogen content, color traits, chlorophyll fluorescence, transpiration, water use efficiency, etc. (Yang et al. 2020). The huge amount of phenomics information has to be translated into relevant biological knowledge. For this purpose, meta-analysis of heterogeneous data and artificial intelligence techniques are unavoidable, but cannot be a substitute for question-driven and model-assisted phenotyping (Tardieu et al. 2017).

Inevitable pleiotropy

Phenotypes cannot be counted, but genomes are finite objects. Whole genome sequencing of many crops, livestock and model species has been achieved, and the number of genes is “only” a few tens of thousands for most multicellular organisms (https://www.ncbi.nlm.nih.gov/genome/). Thus, the prevalence of pleiotropy appears to be an inevitable consequence of the radical difference in nature between phenome and genome. On the one hand there is the impossible discretization of phenotypic features, resulting in a near infinity of traits that are inevitably related through the entanglement of multiple gene/metabolic/developmental networks (Boyle et al. 2017); on the other hand, there is a finite stock of informational units, the genes. Of course, this does not mean that each gene variation affects every phenotypic feature at any organizational level. Depending on where the gene product acts in the cell machinery, the extent of pleiotropy may differ considerably (reviewed in Stearns 2010). Nevertheless, it is expected that sufficiently fine-scale observations would reveal that any given gene displays some pleiotropy. In this context, the search for “orthologous” phenotypes, or “phenologs”, using a gene-based classification of phenotypes could be a promising line of research (McGary et al. 2010) (see also Edmunds et al. 2015).

Interestingly, the possible link between variations in different traits has been recognized very early on, even in the context of formal genetics. One of the seven traits studied by Mendel (1866) in pea, the seed coat color, was strictly correlated with the pattern of flower color and the color of the stem in the axils of the leaves, so that he considered these three “differences” to be a single one. As early as 1915, Morgan et al. (1915) wrote: “It is customary to speak of a particular character as the product of a single factor, as though the factor affected only a particular color, or structure, or part of the organism. But everyone familiar at first hand with Mendelian inheritance knows that the so-called unit character is only the most obvious or most significant product of the postulated factor. Most students of Mendelian heredity will freely grant that the effects of a factor may be far-reaching and manifold.” They gave various examples in Drosophila of what they called the “manifold effects of single factors”, such as the club mutant, in which the wing pads may fail to unfold, but also where (i) a pair of spines is absent, (ii) spines from another pair point in an abnormal direction, (iii) the head is often flattened, (iv) the eyes are smaller and (v) the thorax and abdomen are somewhat distorted. They concluded: “Here we have an example of a single germinal difference [...] producing several distinct effects [...]”. (Note that they did not use the word “pleiotropy”, which had been coined five years earlier by L. Plate, although in an article written in German [cited in Stearns 2010]).

The phenotypic level matters for the genotype–phenotype relationship

As previously mentioned, traits can be measured and/or calculated at any level of phenotypic organization. Because trait variation is polygenic in the vast majority of cases, even for molecular phenotypes, the concepts and approaches of quantitative genetics can be applied, regardless of the level considered: quantitative trait loci (QTL) are mapped and their effects measured, genetic effects such as dominance or epistasis are quantified, heritability is measured, etc. However, there are clear-cut differences between phenotypic levels regarding key genetic and evolutionary features:

(i) Most of the variation observed at molecular levels does not translate into variation at the level of integrated traits. This initially surprising observation has been at the root of the neutral theory of molecular evolution (Kimura 1983). We now know that variation at lower levels affects the higher levels following non-linear processes, which may result in “phenotypic buffering” (Jingyuan et al. 2009), “pervasive robustness in biological systems” (Félix and Barkoulas 2015) or canalization (Waddington 1957; Gibson and Dworkin 2004). The seminal example of such a mechanism is the enzyme-flux relationship in metabolic networks: the shape of the relationship (a concave curve reaching a plateau) implies that an increase in enzyme activity or concentration may have a negligible effect on the flux (Wright 1934). Various other mechanisms have been described, such as feedback loops, feedforward motifs, signal amplification in signaling pathways, which all entail saturation curves that account for the buffering of high-level phenotypic traits (Félix and Barkoulas 2015; Alon 2020). Another consequence of saturation curves is dominance. Concavity implies that the phenotypic value of the heterozygote is higher than the mean of the homozygotes, even though there is semi-dominance at the lower adjacent level (Wright 1934; Kacser and Burns 1981). The generalization of this model to polygenic traits may account for heterosis, i.e. the superiority of the hybrid over its parents (Fiévet et al. 2018).

(ii) Traits related to fitness have usually lower narrow sense heritability (\(h^2\)) than less integrated traits. In 75 diverse animal species belonging to three distant groups (invertebrates, ectotherms and endotherms), 1 120 \(h^2\) estimates were compared in four trait categories: life-history (fecundity, viability, survival, development rate), behavioral, physiological and morphological (Mousseau and Roff 1987). Even though there were differences between the three groups of species, results showed that on average life-history traits had the lowest heritability, morphological traits the highest, and behavioral and physiological traits were intermediate (Mousseau and Roff 1987). A similar trend was observed in Arabidopsis. From 199 ecotypes, heritability was measured for 107 traits classified into flowering-related traits, defense-related traits, ionomics traits and developmental traits. On average, the flowering-related traits displayed lower heritability than the other traits (Yang 2017). Overall, these results are consistent with evolutionary theory, which claims that natural selection decreases the additive genetic variance in traits that are tightly associated with fitness (Douglas 1981; Lynch and Walsh 1998).

(iii) Inbreeding depression is higher for life-history traits than for morphological traits. In a survey of 54 animal species, DeRose and Roff (1999) compiled inbreeding depression values for 35 life-history traits (survival, development time, fecundity) and 10 morphological traits such as adult body size, bristle number, etc. They showed that at \(F = 0.25\) (full-sib inbreeding coefficient), life-history traits experienced a mean reduction of \(\approx 11.8\%\) in trait value, whereas morphological traits showed a mean reduction of only \(\approx 2.2\%\). According to the authors, the most likely explanation is that positive dominance is on average lower for morphological traits than for life-history traits.

The explanations given for each of the three observations mentioned above do not exclude a unifying understanding based on the non-linear relationship between successive phenotypic levels. The following hypothesis is based on the example of genetic mitochondrial diseases, where phenotypic manifestations of the defect only occur when a certain proportion of mutated mtDNA is exceeded. This “phenotypic threshold effect” was qualitatively explained by the concave relationship at five successive levels of expression of a given mtDNA mutation: translation, enzyme activity, respiratory flux, cell activity and clinical manifestations (the integrated phenotype) (Rossignol et al. 2003). In fact, if there is a cascade of concave relationships, the curvature of the genotype–phenotype relationship is assumed to increase across phenotypic levels, resulting in a steeper ascending part of the curve and a larger plateau. In genetic terms, this increase in curvature as phenotypic levels become more integrated may account for phenotypic buffering and—because dominance is larger in high-level phenotypes—for both increased inbreeding depression and lower narrow sense heritability for fitness-related traits. Experimental assessment of this hypothesis could rely on multi-scale phenotyping of parents and offspring.

In conclusion, a hint of epistemology

In its simplest expression, the epistemological concept of emergence states that “the whole is more than the sum of its parts” (Mill 1843; O’Connor 1994) and “more is different” (Philip 1972): a given organizational level may display properties that do not exist at lower levels. Biological systems, with their highly hierarchical organization, provide a myriad of examples. For instance, enzyme catalysis, rhythms, phyllotaxis spirals, the sense of smell, memory, consciousness, social organization, etc., are properties that emerge at a given level of organization—cellular, organismal or populational—and that do not make sense at other levels. The concept of phenotype is tightly linked to that of emergence. The phenotypic traits measured or calculated at a given organizational level characterize the properties that emerge at this level. Whether or not higher-level properties are reducible to lower level properties is a long-standing debate in the philosophy of sciences (see e.g. Francescotti 2007). Without taking sides, it is clear that a purely reductionist approach—albeit essential in biology—is not sufficient when dealing with properties at integrated levels, and systemic methods have to be used. This is the ongoing and daunting challenge we have to face to understand the genotypic bases of variation, a central question in biology.