Introduction

Stem cells are characterized by their ability to self-renew and differentiate along multiple distinct lineages. Due to these remarkable properties, there is much hope for stem cell-based therapies in regenerative medicine. However, the development of such therapies will require a thorough understanding of the molecular mechanisms by which stem cells balance self-renewal and differentiation. Since stem cells are often rare (as in the adult) or exist only transiently (as in development), recent years have seen a growing focus on using single-cell profiling technologies to understand stem cell dynamics. These studies have indicated that apparently functionally homogeneous stem cell populations can vary widely in their expression of important regulators of self-renewal and multipotency. In some cases, this variability is driven by dynamic fluctuations of important master transcription factors, suggesting that stem cell heterogeneity has an important functional role [1, 2, 3••]. However, the relationship between molecular heterogeneity and stem cell function are still not well understood.

Recent years have seen remarkable advances in single-cell sequencing techniques, and it is now possible to profile large portions of the genome, or the entire transcriptome, in hundreds to thousands of individual cells in a single experiment [4••, 5, 6]. Advances in single-cell epigenetics and proteomics are not far behind [7,8,9,10]. These advances promise to transform our understanding of cellular identities, yet they also produce vast amounts of complex data, making it a significant challenge to distinguish meaningful biology from experimental noise. In the context of stem cell dynamics, numerous reports have indicated that functionally homogeneous stem cell populations, both from the adult and the embryo, are highly heterogeneous with respect to their patterns of gene and protein expression [11,12,13,14,15]. However, the extent to which this variability plays a functional role, and the extent to which it represents variability due to inherent, but non-functional, expression noise are not clear. Therefore, in order to understand stem cell function at the individual cell level, it has become increasingly necessary to use high-throughput profiling techniques to explore co-expression dynamics at the single-cell level to identify rare (yet potentially functionally important) cells and determine how co-expression patterns change over time. The data provided by these experiments are fundamentally different from those obtained from measurements on cellular aggregates. While bulk methods typically provide estimates of the mean expression of each variable (e.g., gene) profiled over all cells in the aggregated sample (perhaps along with estimate of variance when the sample mean of multiple replicates are taken), they are not generally well suited to exploring dependencies between variables because they are only capable of examining expression patterns on average, not within individual cells. By contrast, since single-cell methods profile co-expression patterns within individual cells they are able to provide a sample from the joint distribution of all the variables being profiled and so are much better suited to explore functional relationships between variables. Importantly, recent years have seen significant improvements in the efficiency of single-cell RNA-sequencing methods, which now allow profiling of many tens of thousands of individual cells thereby improving estimates of joint expression distributions [6, 16•, 17•]. The experimental progress made in capturing multivariate single cell data has also stimulated research into new analysis techniques that are specifically designed to handle high-dimensional single-cell data [18, 19]. These new analysis methods often make use of classical multivariate statistics and statistical approaches have provided insight into many stem cell systems including identification and characterization of mixtures of cellular states [20], comparison of different stem cell lines [21], rare cell identification [22], and cell lineage decision-making [23]. However, methods from information theory are increasingly also being used to better understand how cellular expression patterns determine cellular identities.

Information Theory

Information theory has its roots in Shannon’s work on communication and his famous 1948 paper laid out the mathematical theory of information [24, 25]. Shannon realized that in order to quantify the information content of a message, it is necessary to consider the message’s context, or how probable it is. An intuitive understanding of this can be seen in the following example. Consider a search for this article using only the last name of one of the authors. Which one is it best to choose? The knowledge that “Smith” is a very common last name and “MacArthur” is less common means that searching for “MacArthur” is more likely to narrow the search and therefore likely to provide more information. The fact that “MacArthur” is a more complex word than “Smith” is irrelevant: it is the rarity of each name that dictates which to choose, not the name itself. In the context of gene expression, the fact that a cell has seven transcripts of a particular messenger RNA (mRNA) does not in itself carry any information: this observation requires context in order to understand how much information is gained from the measurement. Without the context of how likely a read of seven transcripts is, the information gained from the measurement is unknown (colloquially this is known as Shannon’s zeroth law). So how do we calculate information gain? Shannon argued that any measure of information should satisfy three basic requirements: monotonicity, independence, and branching. Monotonicity ensures that the information gained from a question with a wide variety of answers is greater than the information gained from the answer to a question with only a few possible answers. For example, to identify a specific person an answer to the question “where do they come from?” provides more information than an answer to the question “are they female?” Independence ensures that the total information gained from two independent questions is a sum of the information gained from the questions separately. So, for example, the order in which the questions are asked should not matter. Lastly, branching ensures that when a series of questions is composed in a tree-like structure, the overall information gained by passing along a path through the tree is a weighted sum of the information gained from each branch point [25].

Shannon proved that the following function, which he called the entropy by analogy to the closely related thermodynamic entropy, uniquely satisfies these conditions. The Shannon entropy H is the expected amount of information gained from answering a question for which the probability of answer x is given by p(x),

$$ H(X)=-\sum_x p(x)\mathit{\log} p(x). $$

The entropy is a property of the probability distribution p(x), in the same way that the mean and variance are properties of p(x). Informally, the entropy is simply a measure of how “flat” or close to uniform p(x) is, and the “flatter” a distribution is, the greater the entropy and information gained. The units of entropy depend on the base of the logarithm: when the logarithm is taken to the base 2, as is common in information theory, entropy is measured in bits (one bit is the amount of information provided when observing one of two equally likely outcomes, e.g., the flip of a fair coin). Alternatively, entropy is measured in nats when using the natural logarithm (as is typically the case in statistical mechanics), and in hartleys when using base 10 (one hartley is the amount of information provided when observing one of 10 equally likely outcomes, e.g., a uniformly randomly chosen decimal digit). The equation for the entropy given above assumes that the random variable X is discrete. In practice, many measures of interest, such as molecular concentrations, are continuous and the continuous analogue to the entropy above is known as the differential entropy [26]. In the discrete case, the entropy has some useful properties (for example, H(X) ≥ 0) that are not inherited by the differential entropy. To account for these differences, several closely related variations such as the Kullback-Leibler divergence (also known as the relative entropy) and its generalizations are commonly used to assess similarity between continuous expression distributions [27, 28]. For example, the widely used t-SNE dimensionality reduction algorithm [29] (which has been used in several recent stem cell studies to explore heterogeneity in stem cell identities and cluster cell states [16, 17•, 22, 23]) uses the Kullback-Leibler divergence to assess the similarity between the observed co-expression distribution and that obtained by projecting the data to a lower-dimensional space.

Information Theory and Stem Cell Biology

The utility of the entropy in understanding cell identities many be illustrated by returning to our example of the measurement of seven mRNA transcripts in a cell. To gain context to this reading, we need to better understand the natural variability of mRNA expression in the cell population of interest to determine how unusual this reading is. Consider the following two hypothetical scenarios for mRNA expression in a population of stem cells, as shown in Fig. 1a: (Scenario 1) all cells in the population have seven mRNA transcripts (i.e., 7 is the only answer to the question how many transcripts are in the cell? and occurs with probability 1). In this case, since all cells are the same with respect to their transcript counts, the observation of seven transcripts cannot be used to discriminate one cell from another, and therefore does not impart any information. Accordingly, the entropy is H =  − 1 log(1) = 0 bits. (Scenario 2) Two stem cell subtypes are present in the population (types A and B). Cells of type A occur with probability 0 < p < 1 and have seven transcripts, while cells of type B occur with probability (1 − p) and have zero transcripts. In this case, the observation of seven transcripts allows us to positively discriminate cells of types A from those of type B and so imparts useful information. Furthermore, the amount of information we gain is related to the relative rarity of types A and B. In particular, the entropy is given by, H =  − p log(p) − (1 − p) log(1 − p). Thus, when p is small, the observation of seven transcripts in a cell is a rare event, but the observation of zero transcripts is a common event and so the entropy is low. Conversely, when p is large the observation of seven transcripts is a common event while the observation of zero transcripts is a rare event and again the entropy is low. However, when cells of both types are common in the population (i.e, if p ~ 0.5) then the entropy reaches its maximum. In this example, it is worth noting that the fact that cells of type B express no transcripts is not relevant to the calculation of entropy, they could have expressed any number of transcripts not equal to seven: all that is important is that cells of type B can be distinguished from those of type A by their mRNA transcript count.

Fig. 1
figure 1

Entropy and mutual information. a Entropy of hypothetical binary cell types: Scenario 1: all cells have 7 mRNA transcripts and entropy is zero (there is no uncertainty). Scenario 2: cells are either type A (7 transcripts), which occurs with probability p, or type B (zero transcripts), which occurs with probability 1 − p. When there is an equal probability of observing either cell type (p = 0.5), we are maximally uncertain about the identity of a randomly draws cell and the entropy H = 1 bit, the same as for tossing a fair coin. When there are unequal probabilities, for example when p = 0.25, uncertainty is reduced and the entropy is less than 1 bit. The final panel gives the relationship between entropy and p from which it can be seen that maximum entropy occurs when p = 0.5. b Entropy of distributions: distributions of transcript abundance are typically not binary, but rather exhibit a spread of possible outcomes. Examples of a unimodal and a bimodal distribution with the same mean and variance, but different entropies are shown. In the unimodal case the measures such as the mean and variance may make good sense. However, in the bimodal case the population mean is not characteristic of either of the two subpopulations (it is rare to find a cell with the mean level of expression) and the variance as a measure of the spread about this mean is also misleading. By contrast, the entropy, which measures the amount of uncertainty we have concerning the identity of a randomly draw cell from the population, provides useful information about cell–cell variability. c Mutual information as a measure of association: association between two random variables can be assessed by Pearson’s correlation coefficient (PCC), which considers the strength of linear association, Spearman’s correlation coefficient (SCC) which is based on rankings, and mutual information (MI) which assesses how much information one variable provides about the other. All three measures can assess linear associations well (left panel), SCC is a good measure of non-linear, monotonic associations (middle panel), but neither PCC nor SCC are good measures of association for non-linear, non-monotonic associations (right panel). However, the MI may be used to determine that the two variables are related

In practice, we would not expect that all cells express a given mRNA at one of two fixed levels; rather, intrinsic noise in gene expression naturally gives rise to variations in gene expression levels over time within each individual cell, and within the cell population at any fixed time (see Fig. 1b). While it cannot often be calculated explicitly as above, the entropy can nevertheless be estimated from experimental data to better understand this natural variation (it should be noted that entropy estimation is subject some technical issues including the effect of data binning and bias on entropy estimation [30,31,32]). For example, it has been suggested that a high degree of cell–cell variability in gene expression patterns within a functionally pure population, as quantified by the entropy of the joint expression distribution, is characteristic of undifferentiated pluripotent cells [33, 34•, 35]. Similarly, by considering patterns of gene expression in light of a known signaling networks, Teschendorff and colleagues have argued that both pluripotent cells and cancer cells are associated with a state of high network entropy, characterized by the promiscuous co-expression of important hub proteins [36,37,38]. Relatedly, it has been observed that the entropy of gene expression developing tissues increases with time in a manner that is closely related to differentiation dynamics [39,40,41].

While the entropy is good at assessing how likely it is that a particular expression value will occur, it is not well suited to assessing relationships between co-expression patterns. To do so, a related measure, the mutual information (MI), is also widely used. Consider two discrete random variables, X and Y, which may be related in some unknown way. The entropy of the joint probability density p(x, y) is:

$$ H\left( X, Y\right)=-\sum_x\ \sum_y p\left( x, y\right)\mathit{\log}\ p\left( x, y\right) $$

Informally, this is a measure of the information content of the joint distribution, but it is not a direct measure of association between the two random variables. In order to assess whether one variable provides information about the other, the mutual information I(X; Y) may be used [26]. The mutual information compares the observed joint probability density with that which would be observed if the two random variables were independent. In particular,

$$ I\left( X; Y\right)=\sum_x\ \sum_y p\left( x, y\right)\mathit{\log}\frac{p\left( x, y\right)}{p(x)\ p(y)},\kern0.5em = H(X)+ H(Y)- H\left( X, Y\right), $$

where H(X) and H(Y) are the marginal entropies. If X and Y are independent then p(x, y) = p(xp(y), so \( \mathit{\log}\frac{p\left( x, y\right)}{p(x)\ p(y)}=\mathit{\log}1=0 \) for all x and y and therefore I(X; Y) = 0. In this case, knowledge of one variable does not provide any information about the other variable. More generally since I(X; Y) = I(Y; X) ≥ 0 , the magnitude of the MI is a measure of the extent to which the observed joint distribution deviates from independence: larger values of MI indicate a stronger dependency between X and Y. The advantage of MI as a measure of association is that it does not specify in advance the nature of the relationship between X and Y so it can capture non-linear, non-monotonic, dependencies between variables in a general way that traditional correlation measures cannot (see Fig. 1c for some examples).

Since the mutual information assesses the extent to which two random variables are independent of one another, it can be used to identify putative functional relationships between experimentally observed variables (e.g., genes or proteins) [42, 43]. For this reason, there has been much interest in using information-theoretic methods to infer genetic regulatory networks (GRNs) from gene expression data, in order to better understand cellular dynamics. Inferring a GRN first involves constructing a matrix of “similarities” between genes based upon pairwise comparisons of their expression profiles. Typically, similarity is assessed using either correlation-based measures or information-theoretic-based measures. Comparisons of different measures and their implementations can be found in a number of recent reviews [44, 45, 46••]. In order to infer a network of regulatory links from the similarity matrix, some form of filtering is required to exclude spurious links or links with little statistical support. The simplest form of filtering is thresholding. However, since thresholding does not take into account the local context of the interactions, it is a crude approach that does not necessarily retain important interactions or exclude unimportant interactions. To address this shortfall, information theory can be used to refine networks based upon thresholding and thereby infer regulatory interactions with greater accuracy. These methods are typically based upon more advanced information-theoretic notions such as the data processing inequality (DIP), which allows indirect regulatory links to be removed from a network (the well-used ARACNE algorithm [47•, 48] makes use of the DIP); conditional mutual information, which assesses the dependency between two random variables conditioned on a third and can be used to uncover genes that have shared upstream co-regulation [49]; the maximum relevance/minimum redundancy (MRMR) strategy, which ranks potential interactions for a given gene based on the difference between MI for the gene and a putative target (maximum relevance) and the average MI of the putative target with all previously ranked targets (minimum redundancy). The aim of this last approach is to rank highly the most informative interactions and is implemented in the MRNET and MRNETB algorithms, for example [50]. Other GRN inference methods make use of closely related information-theoretic measures such as the transfer entropy (as implemented in the MIDER algorithm, for example [51]) and multivariate mutual information measures [52]. In the context of stem cell biology, information-based network reconstruction methods have been used with some success to identify novel regulators of pluripotency and lineage specifiers [53, 54] as well as track changes in network structures during cellular differentiation [52, 55].

Conclusions

Here, we have summarized some of the ways that information theory can be used in combination with multivariate statistics to investigate stem cell identities. Although information-theoretic measures are not always intuitive and their practical application needs careful consideration, information theory provides a suite of tools that can help make the most of experimentally hard-earned data. As well as providing improved measures of variability and association, information theory also has a natural relationship with statistical mechanics [56, 57], and thereby provides a natural approach to the investigation of cellular dynamics. Statistical mechanics addresses the question of how observable “macroscopic” properties of a system arise from unobserved “microscopic” dynamics. For example, the pressure of a gas in a confined container (a macrostate) depends upon the average kinetic energy of the molecules in the gas and can therefore be predicted without detailed knowledge of the instantaneous position and velocity of all the individual gas molecules involved (a microstate). In the 1950s, Jaynes showed that statistical mechanics could be derived directly from information-theoretic principles [56, 57]. For example, he observed that the Boltzmann distribution, which is ubiquitous in statistical mechanics, arises naturally as the maximum entropy probability distribution subject to appropriate physical constraint. It would be interesting to see if similar approaches can be used to better understand cell–cell variability in stem cell systems: do observed patterns of variability in stem cell populations reflect natural biological constraints? If so, what are they? To what extent does cell–cell variability relate to stem cell function? Can a general theory of regulated cellular variability be derived using physical and information-theoretic principles? Some minor progress has been made towards these aims [33, 58,59,60] and this is an exciting area of current research, yet there is still much to be done. Although the relationships between cell–cell variability, entropy, and cell function have yet to be fully deciphered, ongoing research indicates that information-theoretic measures can provide insight into cellular identities that are not apparent from more traditional multivariate statistical methods. We anticipate that advances in the accuracy and reductions in the cost of single-cell methods are likely to see increased interest in the development and use of these methods in the near future.