Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

This essay attempts to persuade the Reader that statistical observations that may be dismissed as “marginal” or even “oddities” are far from such. On the contrary, they provide insights that lead to a better understanding of mechanisms which logically should exist but for which evidence is (and likely has to be) missing. To remain focused, we adhere to probabilistic models in evolution, genetics and cancer, disciplines in which the author claims expertise. The paper includes three case studies. First, ascertainment bias in evolutionary genetics, arising when comparison between two or more species is based on genetic markers discovered in one of these species. Second, quasistationarity, i.e., probabilistic equilibria arising conditionally on non-absorption. Since evolution is the history of extinctions (which are absorptions), this is a valid field of study. Third, inference concerning unobservable events in cancer, such as the appearance of the first malignant cell, or the first micrometastasis. The topic is vital for public health, particularly in aging societies. We try to adhere to mathematical rigor wherever needed and to provide references. Discussion concerns the wider context and philosophical implications.

2 Ascertainment Bias in Evolutionary Genetics

It has been observed that in evolutionary comparisons of Species 1 and 2, it is easy to err by using markers that were discovered in Species 1 and then sampled (“typed”) in Species 1 and 2. Genetic markers have to exhibit among-individual variation to be useful and therefore if a marker is discovered in Species 1, then on the average it is more variable in Species 1 than in Species 2. Variability of markers serves as a proxy for the rate of nucleotide substitution, which in turn may be a proxy for the rate of evolution. For this reason, if Species 1 and 2 descend from a common ancestral species, such as Human and Chimpanzee, and markers discovered in Species 1 (Human, for example) are employed, then we may deduce that Human has been evolving faster than its sister species Chimpanzee, when in fact it has not [2, 7, 23]. One remedy for this effect (being a form of the ascertainment bias) is to also use markers discovered in Species 2 and compare the outcomes in both cases. However, how to analyze such data and what inferences might be drawn? Li and Kimmel [19] demonstrate that this is quite complicated and that conclusions may be far from obvious.

2.1 Microsatellite DNA and Divergence of Human and Chimpanzee

Microsatellite loci are stretches of repeated DNA motifs of length of 2–6 nucleotides. An example is a triplet repeat (motif of length 3) with allele length \(X=4\) (motif repeated 4 times)

$$ \cdots |ACG|ACG|ACG|ACG|\cdots . $$

Mutations in such loci usually have the form of expansions or contractions occurring at a high rate, \(\nu \sim 10^{-3}\)\(10^{-4}\) per generation. More specifically,

$$\begin{aligned} X\longrightarrow X+U \end{aligned}$$
(1)

where U is an integer-valued random variable, at times constituting a Poisson process with intensity \(\nu \). Mutations in this Stepwise Mutation Model (SMM), mathematically form an unrestricted random walk (see e.g., [9]). Microsatellites are highly abundant in the genome. They are also highly polymorphic (variable). Applications of microsatellites include: forensics (identification), mapping (locating genes), and evolutionary studies.

Fig. 1
figure 1

Evolutionary history of a locus in two species. Demographic scenario employed in the mathematical model and simuPOP simulations. Notation: \(N_{0}\), \(N_{1}\), and \(N_{2}\), effective sizes of the ancestral, cognate, and noncognate populations, respectively; \(X_{0}\), \(X_{1}\), and \(X_{2}\), increments of allele sizes due to mutations in the ancestral allele, in chromosome 1 and in chromosome 2, respectively. From Ref. [19]

A microsatellite locus can be considered to have a denumerable set of alleles indexed by integers. Two statistics can summarize the variability at a microsatellite locus in a sample of n chromosomes: The estimator of the genetic variance

$$\begin{aligned} \hat{V}/2={\displaystyle \sum _{i=1}^{n}}\left( X_{i}-\overline{X}\right) ^{2}/\left( n-1\right) , \end{aligned}$$
(2)

where \(X_{i}=X_{i}(t)\) is the length of the allele in the ith chromosome in the sample and \(\overline{X}\) is the mean of the \(X_{i}\)

$$\begin{aligned} V(t)=E(\hat{V})=E[(X_{i}-X_{j})^{2}], \end{aligned}$$
(3)

and \(X_{i}\) and \(X_{j}\) are exchangeable random variables representing the lengths of two alleles from the population [17]; and the estimator of homozygosity

$$\begin{aligned} \hat{P}_{0}=(n{\displaystyle \sum _{k=1}^{K}}p_{k}^{2}-1)/(n-1), \end{aligned}$$
(4)

where \(p_{k}\) denotes the relative frequency of allele k in the sample

$$\begin{aligned} P_{0}(t)=E(\hat{P}_{0})=\Pr [X_{i}(t)=X_{j}(t)]. \end{aligned}$$
(5)

Random variables \(X_{i}\) are exchangeable but not independent.

Li and Kimmel [19] considered evolutionary history of a locus in two species. They employed the following demographic scenario in the mathematical model and simuPOP [20] simulations (Fig. 1). At time t before present (time is counted in reverse direction), a microsatellite locus is born in an ancestral species. At time \(t_{0},\) the ancestral species splits into species 1 (called cognate) and species 2 (called non-cognate). Notation: \(N_{0}\), \(N_{1}\), and \(N_{2}\), are effective lengths of the ancestral, cognate, and non-cognate populations, respectively; \(X_{0}\), \(X_{1}\), and \(X_{2}\) are increments of allele lengths due to mutations in the ancestral allele, in chromosome 1 sampled at time 0 (present) from cognate population 1 and in chromosome 2, sampled from the non-cognate population 2.

2.2 Ascertainment Bias versus Drift and Mutation

In the random walk-like SMM model of mutation, a good measure of variability at a microsatellite locus is the length (repeat count) in a randomly sampled individual. Let us suppose that we discover a sequence of short motif repeats in the cognate species 1 and if its number of repeats \(Y_{1}\) is greater or equal the threshold value x, we retain this microsatellite (we say we discovered it). Then we find a homologous microsatellite in species 2, i.e., microsatellite which is located in the same genomic region (technically, flanked by sequences of sufficient similarity), provided such microsatellite can be found. We take samples of microsatellite lengths from species 1 and 2, and consider their lengths to be realizations of random variables \(Y_{1}'\) and \(Y_{2}\), respectively. We then consider the difference

$$ D=E[Y_{1}'|Y_{1}\ge x]-E[Y_{2}|Y_{1}\ge x]. $$

Other things being equal, D is a manifestation of the ascertainment bias and is likely to be positive. However, things may not be entirely equal. For example, if species 1 has a lower mutation rate than species 2, then its microsatellites will tend to have lower maximum length, which may reduce D. On the other hand, if, say, species 2 consistently has had a smaller population size, then genetic drift might have removed some of the variants and now species 2 microsatellites will have lower maximum length, which may inflate D. Li and Kimmel [19] carried out analytical and simulation studies of D under wide range of parameter values and obtained very good agreement of both techniques (Fig. 2). Briefly, as explained already, the observed difference D in allele lengths may be positive or negative depending on relative mutation rates and population sizes in the species 0 (ancestral), 1, and 2. In conclusion, mutation rate and demography may amplify or reverse the sampling (ascertainment) bias. Other effects were studied by different researchers. For example, Vowles and Amos [23] underscore the effects of upper bounds of repeat counts. An exhaustive discussion is found in Ref. [19].

Fig. 2
figure 2

Observed difference D in allele sizes may be positive or negative. Comparison of simuPOP simulations with computations based on Eq. (15). a Values of D for the basic parameter values \(b_{0}=b_{1}=b_{2}=b=0.55\), \(\nu _{0}=\nu _{1}=\nu =0.0001\), \(t_{0}=2\times 10^{5}\) generations, and \(t=5\times 10^{5}\) generations, with the effective sizes of all populations concurrently varying from \(2\times 10^{4}\) to \(4\times 10^{5}\) individuals and with mutation rates \(\nu _{2}\) varying from \(\nu \) to \(5\nu \). b Values of D for the basic parameter values \(b_{0}=b_{1}=b_{2}=b=0.55\), \(\nu _{0}=\nu _{2}=\nu =0.0001\), \(t_{0}=2\times 10^{5}\) generations, and \(t=5\times 10^{5}\) generations, with the effective sizes of all populations concurrently varying from \(2\times 10^{4}\) to \(4\times 10^{5}\) individuals and with mutation rates \(\nu _{1}\) varying from \(\nu \) to \(5\nu \) (assuming 20 years per generation). From Ref. [19]

2.3 Hominid Slowdown and Microsatellite Statistics

Li and Kimmel [19] considered evidence for and against the so-called hominid slowdown (as discussed e.g., in Bronham et al. 1996), the observation that as the great apes become closer to the Human lineage, their nucleotide substitution rates (rates of point mutations in the genome) decrease. Consistent with this, Human and Human ancestors are expected to have slower substitution rates than Chimpanzee and its ancestors (following the divergence from the common ancestral species about 7 million years ago). Is this also true of microsatellite loci? Different molecular mechanisms shape these two types of mutations. Nucleotide substitutions result from random errors in DNA replication, which then may not be repaired, but also may lead to dysfunctional proteins which will be eliminated from the population by natural selection (as discussed e.g., in [10]). Microsatellite mutations, as explained already, result from replicase slippage. Most microsatellites are located in noncoding regions and therefore are considered selectively neutral.

The study [19] involves a reconstruction of the past demography of Human and its ancestors as well as hypothetical demography of Chimpanzee and its ancestors, including migrations of Human from its ancestral African territory and resulting population growth interrupted by recent glaciations and other events. Without getting into technical details, the conclusion is that microsatellite mutation rate is likely to be higher in Human than in Chimpanzee. It is interesting to observe that also the regulatory sites in the genome usually have the form of simple repeats (albeit interrupted) and vary quite considerably among species of mammals (as reviewed e.g., in Ref. [13]). It is possible to further hypothesize that evolution in higher mammals chose the path of regulation of gene expression as opposed to modification of the amino acid sequences in proteins; possible reason being that these latter might be too slow.

3 Quasistationarity in Genome Evolution

Let us consider an effect which is important if extinctions are indeed common in evolution. Suppose that a proliferating population has a random component of such nature that it leads any lineage to extinction with probability 1. On the other hand, proliferation is sufficiently fast to make up for extinction so that the non-extinct part may persist indefinitely. The long-term distribution of types of individuals in the population conditional on non-extinction, if such distribution exists, is called the quasistationary distribution. Quasistationarity in a more general sense has been studied by mathematicians for a long time; relevant literature has been collected by Pollet [21]. Here we will limit ourselves to an example from cell biology concerning gene amplification, based on an experiment pioneered by Schimke [22], with mathematical model developed by Kimmel and Axelrod [15] and then generalized by Kimmel [14] and Bansaye [3]. Let us notice that extinction causes information about evolution of the population to be scrambled. Therefore, if quasistationary distributions are interpreted as if they were ordinary stationary distributions, the conclusions may be paradoxical or misleading.

3.1 Gene Amplification in Cancer and Schimke’s Experiments

One of the prevalent types of rearrangements in human cancer genome is gene amplification, i.e., increase of the number of gene copies in cells beyond the usual diploid complement. Some examples have been provided by [1], but the phenomenon is quite common, usually appearing under the guise of copy number variation (CNV; Fig. 3).

Fig. 3
figure 3

Cytogenetics of gene amplification. Amplified DNA can be present in various forms including double minutes. A two-chromosome genome is depicted (top of the figure). Examples of array CGH copy number profiles (bottom left; plotted as the normalized log2 ratio) are shown with corresponding FISH pictures (bottom right) of the cells using BAC clones from the region of the amplicon indicated by the red and green arrows. Many red and green signals can be seen in the double minutes in a methotrexate-resistant human cell line. From Ref. [1]

Classical experiments demonstrating gene amplification and its connection with drug resistance have been carried out in Schimke [22]. The gist of the experimental data can be described as follows. After passaging surviving cultured cells to ever increasing levels of metothrexate (MTX) over the period of the order of 10 Msec \(=\) 5 month, it was possible to evolve cells that were resistant to extremely high doses of MTX (Fig. 4). When the cells were put back into no MTX medium, they were observed to lose resistance within about 100 cell doublings (some cultures did not, but we sweep these under the rug for now).

Fig. 4
figure 4

Loss of resistance in Schimke’s experiments. Cells resistant to MTX are exposed to nonselective conditions. Some cell lines lose resistance completely (circles), while other only partially (squares and triangles). From Ref. [5]

Schimke discovered, using techniques available at that time, that the highly resistant cell had, besides the usual chromosomes, small extrachromosomal DNA elements roughly dicentric (he named them the “double minute chromosomes” or DM for short) that contained extra copies of the dihydrofolate reductase (DHFR) gene, that confers resistance to MTX [1]. It became clear that the increased resistance was due to amplification of the DHFR gene. But how did the amplified copies get there? Clearly a supercritical process of gene copy proliferation was at play. However, how did the cells know to multiply gene copies? The ghost of Lamarck knocked at the door.

3.2 Probabilists to Rescue

Fortunately for the common sense, Kimmel and Axelrod [15] conceived an idea consistent with the neo-Darwinian paradigm (despite appearances, this sentence is not necessarily an oxymoron). The hypothesis can be stated as follows:

  • Increased resistance is correlated with increased numbers of gene copies on double minute chromosomes (DM).

  • The number of DHFR genes on double minutes in a cell may increase or decrease at each cell division. This is because double minutes do not have centromeres, which are required to faithfully segregate chromosomes into progeny cells.

  • The process of DM proliferation in cells is subcritical, since the DM do not efficiently replicate. Therefore cells grown in the absence of the drug gradually lose resistance to the drug, by losing extra gene copies.

The following model has been constructed by Kimmel and Axelrod [15].

  • Galton-Watson process of gene amplification and deamplification in a randomly chosen line of descent (Fig. 5).

    • Double minute chromosomes replicate irregularly

    • Upon cell division, DMs are asymmetrically assigned to progeny cells.

  • The process is subcritical, i.e., the average number of DMs at division is less than twice that number assigned to the cell at birth. This is consistent with imperfect replication and segregation of DMs.

Fig. 5
figure 5

A simplified view of gene amplification and deamplification process. Each cell with at least one gene copy can give rise to 2 progeny cells, each of which with probability b has amplified (doubled) count of DM gene copies, with probability d has deamplified (halved) count, or with probability \(1-b-d\), the same number. Halving of a single DM results in 0 DMs. Histogram at the bottom shows the resulting distribution of gene copies per cell in the fourth generation. From Ref. [15]

Hypotheses of the model explain why, under nonselective conditions, the number of DMs per cell decreases which causes gradual loss of resistance (Fig. 5). In other words, zero is an absorbing state for the number of DMs. However, under selective conditions, only the cells with nonzero DM count survive. Therefore, conditionally on nonabsorption (non-extinction of the DMs), according to the Yaglom theorem for subcritical branching processes, the number of DMs per cell converges in distribution to a quasistationary distribution.

Specifically, suppose that proliferation of DMs from one cell generation to another, in a randomly selected ancestry line of cells is described by a Galton-Watson branching process with the number of “progeny” of a DM is a nonnegative integer random variable with generic probability generating function (pgf) f(s), under the usual conditional independence hypotheses. As already noticed, this process is subcritical, i.e., \(m=f'(1-)<1\). Let \(Z_{n}\) denote the number of DMs in generation n and let \(f_{n}(s)\) denote the pgf of \(Z_{n}\).

Yaglom Theorem (see e.g., Theorem 4 in Kimmel and Axelrod [16]) If \(m<1\), then \(\mathrm {P}[Z_{n}=j|Z_{n}>0]\) converges, as \(n\rightarrow \infty \) to a probability function whose pgf \(\mathcal {B}(s)\) satisfies the equation

$$ \mathcal {B}[f(s)]=m\mathcal {B}(s)+(1-m). $$

Also,

$$ 1-f_{n}(0)\sim \frac{m^{n}}{\mathcal {B}'(1-)},\; n\rightarrow \infty . $$

Yaglom limit is also an example of a quasistationary distribution, say \(\mu (x)\), which in a general Markov chain can be defined via the following condition

$$ \mu (x)=\frac{\sum _{y\ge 1}\mu (y)\mathrm {P}_{y}[X(t)=x]}{\sum _{y\ge 1}\mu (y)\mathrm {P}_{y}[X(t)\ne 0]}, $$

where \(\mathrm {P}_{y}[X(t)=x]\) is the transition probability matrix.

Let us suppose now that cell population has been transferred to MTX-free medium at generation \(n=N\). Based on the Yaglom Theorem, the fraction of resistant cells decreases roughly geometrically

$$ 1-f_{n}(0)\sim \frac{m^{n-N}}{\mathcal {B}'(1-)},\; n>N, $$

while \(\{Z_{n}|Z_{n}>0\}\) remains unchanged. Moreover, if \(2m>1\), then the net growth of the resistant population is observed also at the selection phase (\(n\le N\)).

Loss of DMs in non-selective conditions has been visualized experimentally [5]. Population distribution of numbers of copies per cell can be estimated by flow cytometry. Proportion of cells with amplified genes decreases with time (Fig. 6). Shape of the distribution of gene copy number in the subpopulation of cells with amplified genes appears unchanged as resistance is gradually lost.

Fig. 6
figure 6

Loss of resistance visualized by flow cytometry. Population distribution of numbers of copies per cell can be estimated by flow cytometry. Proportion of cells with amplified genes decreases with time. Shape of the distribution of gene copy number in the subpopulation of cells with amplified genes appears unchanged as resistance is gradually lost. From Ref. [5]

An important finding is that if \(2m>1\), i.e., if absorption is not too fast, then cell proliferation outweighs the loss of cell caused by the selective agent (MTX) and the resistant subpopulation grows in absolute numbers also under selective conditions (when \(n\le N\); details in the original paper and the book).

A more general mathematical model of replication of “small particles” within “large particles” and of their asymmetric division (“Branching within branching”) has been developed by Kimmel [14] and followed up by Bansaye [3]. It is interesting to notice that quasistationary distributions are likely to generate much heterogeneity. An example is provided by large fluctuations of the critical Galton-Watson process before extinction; see Wu and Kimmel [24].

3.3 Quasistationarity and Molecular Evolution

An observation can be made that trends observed in molecular evolution can be misleading, if they are taken at their face value and without an attempt to understand their underlying “mechanistic” structure. It may be concluded, looking at the evolution of resistance in cells exposed to MTX that there exists something in the MTX that literally leads to an increase of the number of DM copies. So, gene amplification is “induced” by MTX. Only after it is logically deduced that DMs have to undergo replication and segregation and assuming that both these processes are less orderly in DMs than in the “normal” large chromosomes, the conclusion concerning the true nature of the process (selection superimposed on subcritical branching) follows by the laws of population genetics.

4 Unobservables in Cancer

Early detection of cancer by mass screening of at risk individuals remains one of the most contentious issues in public health. We will mainly use lung cancer (LC) as an example. The idea is to identify the “at risk” population (smokers in the LC case), and then to apply an “early detection” procedure (CT-scan in the LC case), periodically, among the members of the “at risk” population. By treating the early (and implicitly, curable) cases discovered this way, a much higher cure rate is assured than that of spontaneously appearing symptomatic cases. Is this reasoning correct? Two types of arguments have been used to question the philosophy just described. On one hand, part of the early detection may constitute overdiagnosis. Briefly, by the effect known from the renewal theory, a detection device with less than perfect sensitivity, placed at a fixed point in time and confronted with examined cases “flowing in time”, preferentially detects cases of longer duration, i.e. those for which the asymptomatic early disease is more protracted. This effect is known as the length-biased sampling (discussion in Ref. [12]). Its extreme form, called overdiagnosis, causes detection of cases that are so slow that they might show only at autopsy, or cases which look like cancer but do not progress at all. Overdiagnosis, if it were frequent, would invalidate early detection: a large number of “early non-cancers” would be found and unnecessarily treated, causing increased morbidity and perisurgical mortality, without much or any reduction in LC death count.

On the other hand, the following scenario is possible, which also may invalidate screening for early detection, although for an opposite reason. If it happens that LC produces micrometastases, which are present when the primary tumor is of submillimeter size, then detection of even 2–3 mm tumors (achievable using CT) is futile, since the micrometastases progress and kill the patient whether the primary tumor has been detected or not.

How to determine if screening reduces number of LC deaths? The orthodox biostatistics approach is “empirical”. It consists of designing a two-arm RCT (screened versus non-screened high risk individuals) and comparing numbers of LC deaths in the two arms. This methodology is statistically sound, but it may be considered unethical. Patients in the control arm are denied potentially life-saving procedures. Those in the screened arm do not necessarily take advantage of the current state-of-art technology. Two sources of reduced contrast are: noncompliance in the screened arm and/or “voluntary” screening in the control arm. It has been claimed that the results of the Mayo Lung Project (MLP) 1967–1981 trial, which influenced recommendations not to screen for LC by chest X ray were simply due to lack of power to demonstrate mortality reduction by 5–10% which might be achievable using X ray screening [12]. Finally, the National Lung Screening Trial (NLST) in the USA, in which around 50,000 smokers took part, demonstrated that a series of three annual screenings followed by treatment of detected cases reduced mortality by about 20 %. It has to be noted, that predictions of similar magnitude reduction obtained using modeling [18] have been almost universally disregarded by the medical community.

The NLST has left as many questions unanswered as it did answer. One of them is the choice of the “best” high-risk group for LC screening. Given limited resources, how to allocate them to subgroups of screenees so that the efficacy of a mass screening program is maximized. Even if the meaning of the term “efficacy” is clarified, it is still unknown who should be screened. Are these the heaviest smokers, the smokers who have smoked for the longest period of time, individuals with family history of lung cancer, or those with impaired DNA-repair capacity [11]? At what age does it make sense to start screening and how often should the individuals be screened? Answers to these questions require knowledge of the natural course of disease, which is exactly what is not observable (Fig. 7).

Fig. 7
figure 7

Time lines of cancer progression and detection

Arguably, modeling can help. If a model of carcinogenesis, tumor growth and progression (i.e., nodal and distant metastases) is constructed and validated and models of early detection and post-detection follow-up are added to it, then various scenarios of screening can be tested in silico. Another use of modeling is less utilitarian, but equally important. It can be called the inverse problem: How much is it possible to say about the course of cancer based on snapshots including the disease characteristics at detection? To what extent is the size of the primary tumor predictive of the progression of the disease? In [6] some inferences of this type have been made (Fig. 8).

Fig. 8
figure 8

Distributions of occult nodal and distant metastases in the simulated lung cancer patients (1988–1999) with stage N0M0, N1M0 and M1 stratified by tumor size. *TS, Primary tumor size (cm) in diameter **In SEER data, 7208 were N0M1, which is 9.7 % of 74109 that had N and M staged. This stage is not modeled. From Ref. [6]

Figure 8 depicts distributions of undetected (so-called occult) nodal and distant metastases in the simulated lung cancer patients, fitting demographic and smoking patterns of the SEER database 1988–1999, detected with stage N0M0, N1M0 and M1, stratified by primary tumor size. N0 and N1 correspond to the absence and presence of lymph node metastasis, and M0 and M1 to the absence and presence of distant metastasis, respectively. In other words, modeling allows to estimate how many of lung cancers detected as belonging to a given category, in reality belong to different, prognostically less favorable, categories. The results show some unexpected trends. The most important are the three top rows of Fig. 8, which concern tumors detected without nodal or distant metastases (N0M0). These tumors, on the face of things, offer best prognosis. Model predictions confirm this intuition, up to a point. Indeed up to the primary tumor size of about 1 cm, more than 50 % of apparent N0M0 tumors are indeed N0M0. If they are detected at larger sizes, then only a minority are truly N0M0, and the rest have occult metastases. So, if these tumors below 1 cm are removed, there is a good chance the patient is cured. But, surprisingly, there is another turning point. At sizes above 2.5–3 cm, again majority of tumors are N0M0. Similar, though not as distinctive trend is present when we consider tumors detected as N1M0. Therefore, if a very large tumor is discovered without apparent nodal and distant metastasis and it is resectable, then the suggestion is that it might be resected for cure.

The explanation for this peculiar pattern is that if the rates of growth and progression (metastasizing) of tumors are distributed, then detection is “cutting out windows” in the distributions, through which the tumor population is observed. In the large primary tumor size category with no metastases observed, we deal with the fast growing, slowly metastasizing subset. Other subpopulations simply present with metastasis when the primary tumor is large, become symptomatic and quickly progress to death. So, active detection leads to biased TNM distributions, with the bias sometimes being non-intuitive.

Mathematical models of the NLST trial predicted its outcome in two publications, one in 2004 [18] and the other in 2011 ([8]; submitted for publication before the NLST outcome was announced), using two different modeling approaches. As stated already, at that time these papers were universally ignored.

5 Discussion

What is the role and use of statistics as a profession (science?) and of statisticians as professionals (scientists?). In minds of other scientists (physicists, biologists or physicians) statistics is mainly perhaps a useful, but strictly confirmatory field. What is expected of a collaborating statistician is the “p-value” or the sample size needed to obtain a given “power” of a test as required by the funding agencies. However, one may reflect on how many useful and deep scientific concepts and techniques are statistical in nature. Some of them have been revolutionary. We may list some with biological applications: Fluctuation Analysis (FA) in cell biology, Moolgavkar-Knudson (M-K) model of carcinogenesis, Wright-Fisher (W-F) model in population genetics, Capture-Recapture (C-R) method in ecology, Maximum Likelihood (ML) and Least Squares (LS) methods in molecular phylogenetics, and other. However, let us notice that these methods are based on models that include structural features of the biological nature of the phenomenon in question. Some of these are unobservable, such as mutations in cells in FA, stage transition times in M-K, segregation of chromosomes to progeny in W-F, collections of individuals in C-R and ancestral nodes in phylogenetics.

Arguably, statistics is most useful, when it considers phenomena in a “gray zone” such as inference on the unseens, i.e., processes that we believe are real, but which cannot be directly observed. Three phases of scientific inquiry, are usually present:

  1. 1.

    Initially, when there is little or no data; the unseens are not suspected to exist,

  2. 2.

    Existence of the unseens is revealed through progress in data collection and logical analysis,

  3. 3.

    Further progress may lead to resolution of the unseen by a reductionist approach.

Examples considered in the essay involve analyses in Phase 2. Each involves unseens that may become observable at some time. Also, each required construction of a new model based on inferred biology of the process. In addition, each of the examples includes a statistical sampling mechanism, which introduces a bias (we may call it the ascertainment bias). The role of the model is among other, to understand and counter the bias. Arguably, this is the true purpose of statistical analysis.