Keywords

Cardiac surgery—and particularly surgery for congenital heart disease—was quantitative from the outset [1], more so than most other medical specialties. This was largely stimulated by John Kirklin, who said “our true progress must continue to be based on research done painstakingly and accurately, and on experience honestly and wisely interpreted.” As time went on, he, his colleagues, and others in the field embraced and advocated for statistical methods that answered increasingly important and complex questions. They fostered development of new methods “born of necessity” when they encountered questions existing methods could not answer [2].

With time, however, the underlying philosophical underpinnings and assumptions, limitations, and rationale for use and development of these methods can be forgotten, leading to less understanding and even misunderstanding. Readily available do-it-yourself statistical packages consisting of a limited repertoire of “standard” procedures encourage use of methods, applied with little understanding, that may be inappropriate. Economics may also drive a wedge between collaborating statisticians and clinicians, as meanwhile there is explosive development of statistical techniques, some of which may be perfectly suited to answering the question clinicians are asking.

Therefore, this introductory chapter traces historical roots of the most common qualitative and quantitative statistical techniques that play an important role in assessing early and late results of therapy for pediatric and congenital heart disease. I will introduce them in roughly the order they came into use in this field, which roughly recapitulates the simple to the more complex.

Uncertainty

Confidence Limits

In 1968 at the University of Alabama Hospital, outcomes of portacaval shunting for esophageal varices were presented at Saturday morning Surgical Grand Rounds: 10 hospital deaths among 30 patients. An outside surgeon receiving the live televised feed called in and began, “My experience has been exactly the same.…” Dr. Kirklin asked the caller how many portacaval shunts he had performed. “Three, with one death, the same mortality as you have experienced.”

Dr. Kirklin had no doubt that the caller was being truthful, but intuitively believed that one must know more from an experience on 30 than 3. I indicated that there was a formal way to quantify his intuitive belief: confidence limits. Confidence limits are expressions of uncertainty. It is not that the data are uncertain—indeed, if one just wants to report facts and draw no inferences from them, such as risk for future patients, expressions of uncertainty are not needed. Confidence limits transform historic records of achievement into honest assessments of therapy for future patients, accounting for limited experience.

Underlying the concept of uncertainty, and confidence limits as their reflection, are at least two essential philosophical premises. First, unlike the nihilist, we embrace the philosophy that the future can be predicted, at least to some extent. Second, when we say we are predicting, we are referring to a prediction concerning the probability of an event for a future patient; we generally cannot predict exactly who will experience an event or when that event will occur.

Historically, the roots of confidence limits can be traced to Galileo, seventeenth century gamblers, and alchemists [3, 4]. If three dice are thrown at the same time, the gamblers wanted to know, what is the total score that will occur most frequently, 10 or 11? From 1613 to 1623, increasingly meticulous experiments were done to guarantee fair throws. To everyone’s astonishment, these yielded equal occurrences of every possibility. From these 10 years of experimentation, Galileo developed what became known as the Laws of Chance, now known as the theory of probability [5]. The laws were derived from the ordering logic of combination and permutations that had been developed by the alchemists.

We postulated that events and phenomena of cardiac surgery can also be considered to behave in accordance with this theory. These laws indicate that the larger the sample, the narrower the confidence limits around the probabilities estimated for the next patient. For treatment of patients with congenital heart disease, n—the number of patients—tends to be small. Confidence limits around point estimates of adverse events, therefore, are essential for interpreting the results and drawing inferences about risk for the next patient.

But what confidence limits should we use? We cannot use 100 % confidence limits because they always extend from 0 to 100 %. In the late 1960s, we decided on 70 % confidence limits. This was not an arbitrary decision, but was carefully considered. Seventy percent confidence limits (actually 68.3 %) are equivalent to ±1 standard error. This is consistent with summarizing the distribution of continuous variables with mean and standard deviation, and of model parameter estimates as point estimates and 1 standard error. Further, overlapping vs. nonoverlapping of confidence limits around two point estimates can be used as a simple and intuitive scanning method for determining whether the difference in point estimates is likely or unlikely to be due to chance alone [6]. When 70 % confidence limits just touch, the P value for the difference is likely between 0.05 and 0.1. When confidence limits overlap, the difference in point estimates is likely due to chance; when they are not overlapping, the difference is unlikely to be due to chance alone.

P Values

In the context of hypothesis (or significance) testing, the P value is the probability of observing the data we have, or something even more extreme, if a so-called null hypothesis is true [7]. The phrase “statistically significant,” generally referring to P values that are small, such as less than 0.05, has done disservice to the understanding of truth, proof, and uncertainty. This is in part because of fundamental misunderstandings, in part because of failure to appreciate that all test statistics are specific in their use, in part because P values are frequently used for their effect on the reader rather than as one of many tools useful for promoting understanding and framing inferences from data [810], and in part because they are exquisitely dependent on n.

Historically, hypothesis testing is a formal expression of English common law. The null hypothesis represents “innocent until proven guilty beyond a reasonable doubt.” Clearly, two injustices can occur: a guilty person can go free or an innocent person can be convicted. These possibilities are termed type I error and type II error, respectively. Evidence marshalled against the null hypothesis is called a test statistic, which is based on the data themselves (the exhibits) and n. The probability of guilt (reasonable doubt) is quantified by the P value or its inverse, the odds [(1/P) – 1].

Some statisticians believe that hypothesis or significance testing and interpretation of the P value by this system of justice is too artificial and misses important information [1113]. For example, it is sobering to demonstrate the distribution of P values by bootstrap sampling—yes, P values have their own confidence limits, too! These individuals would prefer that P values be interpreted simply as “degree of evidence,” “degree of surprise,” or “degree of belief” [14]. We agree with these ideas and suggest that rather than using P values for judging guilt or innocence (accepting or rejecting the null hypothesis), the P value itself should be reported as degree of evidence. In addition, machine learning ideas, which view data as consisting of signals buried in noise, have introduced multiple alternatives to P values that are less sensitive to n.

In using P values, some threshold is often set to declare a test statistic “significant.” Sir Ronald Fisher, who introduced the idea of P values, wrote, “Attempts to explain the cogency of tests of significance in scientific research by hypothetical frequencies of possible statements being right or wrong seem to miss their essential nature. One who ‘rejects’ a hypothesis as a matter of habit, when P ≥ 1 %, will be mistaken in not more than 1 % of such decisions. However, the calculation is absurdly academic. No scientific worker has a fixed level of significance at which from year to year, and in all circumstances, he rejects hypotheses; he rather gives his mind to each particular case in the light of his evidence and his ideas” [15].

Human Error

Although it may seem that human error is distanced far from statistics, in fact, qualitative analysis of human error played a prominent role in how we approached statistics in the early 1970s. As early as 1912, Richardson recognized the need to eliminate “preventable disaster from surgery” [16]. Human errors as a cause of surgical failure are not difficult to find [1719], particularly if one is careful to include errors of diagnosis, delay in therapy, inappropriate operations, omissions of therapy, and breaches of protocol. When we initially delved into what was known about human error in the era before Canary Island (1977), Three-Mile Island (1979), Bhopal (1984), Challenger (1986), and Chernobyl (1986), events that contributed enormously to developing formal knowledge of the cognitive nature of human error, we learned two lessons from investigating occupational and mining injuries [20, 21]. First, successful investigation of the role of the human element in injury depends on establishing an environment of non-culpable error [21]. The natural human reaction to investigation of error is to become defensive and to provide no information that might prove incriminating. An atmosphere of blame impedes investigating, understanding, and preventing error. How foreign this is from the culture of medicine! We take responsibility for whatever happens to our patients as a philosophic commitment [22, 23]. Yet cardiac operations are performed in a complex and imperfect environment in which every individual performs imperfectly at times [24]. It is too easy, when things go wrong, to look for someone to blame [25]. Blame by 20/20 hindsight allows many root causes to be overlooked [26]. Second, we learned that errors of omission exceed errors of commission. This is exactly what we found in ventricular septal defect repair, our first published study of human error [19], suggesting that the cardiac surgical environment is not so different from that of a gold mine and that we can learn from that literature.

Those who study human error suggest constructive steps for reducing it and, thus, surgical failure [2730]. They affirm the necessity for intense apprentice-type training that leads to automatization of surgical skill and problem-solving rules [31], the value of simulators for acquiring such skills [32], and creating an environment that minimizes or masks potential distractions while supporting a system that discovers errors and allows recovery from them before injury occurs. In this regard, the pivotal study of human error during the arterial switch operation for transposition of the great arteries by de Leval and colleagues found that major errors were often realized and corrected by the surgical team, but minor ones were not, and the number of minor errors was strongly associated with adverse outcomes [33, 34].

This led Dr. Kirklin to suggest that there were two causes of surgical failure: lack of scientific progress and human error. The former meant that there were still gaps in knowledge that must be filled (research) in order to prevent these failures. The latter meant that we possess the knowledge to prevent human errors, but yet a failure occurred. A practical consequence of categorizing surgical failures into these two causes is that they fit the programmatic paradigm of “research and development”: discovery on the one hand and application of knowledge to prevent failures on the other. The quest to reduce surgical failure by these two mechanisms is what drove us to use or develop methods to investigate these failures in a quantitative way.

Understanding Surgical Failure

Surgeons have intuitively understood that surgical failures, such as hospital mortality, may be related to a number of explanatory variables, such as renal and hepatic function [35]. However, we rarely know the causal sequence and final mechanism of these failures, particularly when they occur after a complex heart operation. There is simply no way of knowing absolutely everything that may have influenced outcome. Although it is not at all satisfying, an alternative to a mechanistic explanation is to identify variables that appear to increase the risk of a patient experiencing an adverse event. This actually is a seminal idea in the history of biostatistics, and it was born and developed in the field of heart disease by investigators in the Framingham epidemiologic study of coronary artery disease [36]. Two papers are landmarks in this regard. In 1967, Walker and Duncan published their paper on multivariable analysis in the domain of logistic regression analysis, stating that “the purpose of this paper is to develop a method for estimating from dichotomous (quantal) or polytomous data the probability of occurrence of an event as a function of a relatively large number of independent variables” [37]. Then in 1976, Kannel and colleagues coined the term “risk factors” (actually “factors of risk”), noting that (1) “a single risk factor is neither a logical nor an effective means of detecting persons at high risk” and (2) “the risk function … is an effective instrument … for assisting in the search for and care of persons at high risk for cardiovascular disease” [38]. In 1979 the phrase “incremental risk factors” was coined at UAB to emphasize that risk factors add in a stepwise, or incremental, fashion to the risk present in the most favorable situation, as we will describe subsequently [39].

A Mathematical Framework for Risk

Multivariable analysis as described by the Framingham investigators required a model as the framework for binary outcomes such as death. The model they chose was the logistic equation, which had been introduced by Verhulst between 1835 and 1845 to describe population growth in France and Belgium [40, 41]. It describes a simple, symmetric S-shaped curve much like an oxygen dissociation curve, in which the horizontal axis is risk (where the lowest possible risk is at – infinity and the highest possible risk is at + infinity, and 50 % risk is at 0) and the vertical axis is the probability of experiencing an event. The model reappeared in the work of Pearl and Reed at Johns Hopkins University in 1920 [42], and then prominently at Mayo Clinic in the 1940, where Berkson described its use in bioassay (introducing terms such as the LD50 dose). The logistic equation was made a multivariable model of risk in the 1960s by Cornfield and colleagues [43] and Walker and Duncan [37].

Unlike most investigators, however, Dr. Kirklin and I approached the risk factor analysis differently from most. We wanted to know how best to use logistic regression to understand surgical failure. This led us to develop a framework to facilitate this process. It was steeped in a concept of incremental risk factors, philosophy, and nomograms.

Incremental Risk Factor Concept

As I described in 1980 at the congenital heart disease meeting in London, an incremental risk factor is a variable identified by multivariable analysis that is associated with an increased risk of an adverse outcome (surgical failure) [6, 39]. In the context of other simultaneously identified factors, the magnitude (strength) and certainty (P value) of an incremental risk factor represent its contribution over and above those of all other factors. Thus, it is incremental in two ways: (1) with respect to being associated with increased risk and (2) with respect to other factors simultaneously incorporated into a risk factor equation.

In understanding surgical failures, we believed the incremental risk factor concept was useful in several ways.

  • Incremental risk factors are variables that reflect increased difficulty in achieving surgical success.

  • Incremental risk factors are common denominators of surgical failure.

  • Some incremental risk factors reflect disease acuity.

  • Some incremental risk factors reflect immutable conditions that increase risk. These include extremes of age, genetic disorders, gender, and ethnicity.

  • Some incremental risk factors reflect influential coexisting noncardiac diseases that shorten life expectancy in the absence of cardiac disease.

  • Incremental risk factors are usually surrogates for true, but unmeasured or unrecognized, sources of surgical failure.

  • An incremental risk factor may be a cause or mechanism of surgical failure. It is difficult to establish a causal mechanism outside the scope of a randomized, well-powered, and well-conducted generalizable clinical trial. This is because of confounding between selection factors influencing treatment recommendations and decisions and outcome. Balancing score methods (such as propensity score) attempt to remove such confounding and approach more closely causal inferences [44].

  • Some incremental risk factors reflect temporal experience. The “learning curve” idea is intended to capture variables relating to experience of the surgical team, but also those representing temporal changes in approach or practice.

  • Some incremental risk factors reflect quality of care and, as such, “blunt end” ramifications of institutional policies and practices, health care systems, and national and political decisions.

  • Incremental risk factors reflect individual patient prognosis. They cannot be used to identify which patient will suffer a surgical failure, but they can be used to predict the probability of failure.

  • However, incremental risk factors may be spurious associations with risk.

Philosophy

The inferential activity of understanding surgical failure, aimed at improving clinical results, is in contrast to pure description of experiences. Its motivation also contrasts with those aspects of “outcomes assessment” motivated by regulation or punishment, institutional promotion or protection, quality assessment by outlier identification, and negative aspects of cost justification or containment. These coexisting motivations stimulated us to identify, articulate, and contrast philosophies that informed our approach to analysis of clinical experiences. I have described these in detail in the Kirklin/Barratt-Boyes text Cardiac Surgery, but a few that were central to how we interpreted risk factors bear repeating [45].

Continuity Versus Discontinuity in Nature

Many risk factors related to outcome are measured either on an ordered clinical scale (ordinal variables), such as New York Heart Association (NYHA) functional class, or on a more or less unlimited scale (continuous variables), such as age. Three hundred years after Graunt, the Framingham Heart Disease Epidemiology Study investigators were faced with this frustrating problem [36, 46]. Many of the variables associated with development of heart disease were continuously distributed ones, such as age, blood pressure, and cholesterol level. To examine the relationship of such variables to development of heart disease, it was then accepted practice to categorize continuous variables coarsely and arbitrarily for cross-tabulation tables. Valuable information was lost this way. Investigators recognized that a 59-year-old’s risk of developing heart disease was more closely related to that of a 60-year-old’s than to that of the group of patients in the sixth versus seventh decade of life. They therefore insisted on examining the entire spectrum of continuous variables rather than subclassifying the information. What they embraced is a key concept in the history of ideas, namely, continuity in nature. The idea has emerged in mathematics, science, philosophy, history, and theology [47]. In our view, the common practice of stratifying age and other more or less continuous variables into a few discrete categories is lamentable, because it loses the power of continuity (some statisticians call this “borrowing power”). Focus on small, presumed homogeneous, groups of patients also loses the power inherent in a wide spectrum of heterogeneous, but related, cases. After all, any trend observed over an ever-narrower framework looks more and more like no trend at all! Like the Framingham investigators, we therefore embraced continuity in nature unless it can be demonstrated that doing so is not valid, useful, or beneficial.

Linearity Versus Nonlinearity

Risk factor methodology introduced a complexity. The logistic equation is a symmetric S-shaped curve that expresses the relationship between a scale of risk and a corresponding scale of absolute probability of experiencing an event [39, 48]. The nonlinear relationship between risk factors and probability of outcome made medical sense to us. We could imagine that if all else positions a patient far to the left on the logit scale, a 1-logit-unit increase in risk would result in a trivial increase in the probability of experiencing an event. But as other factors move a patient closer to the center of the scale (0 logit units, corresponding to a 50 % probability of an event), a 1-logit-unit increase in risk makes a huge difference. This is consistent with the medical perception that some patients experiencing the same disease, trauma, or complication respond quite differently. Some are medically robust, because they are far to the left (low-risk region) on the logit curve before the event occurred. Others are medically fragile, because their age or comorbid conditions place them close to the center of the logit curve. This type of sensible, nonlinear medical relation made us want to deal with absolute risk rather than relative risk or risk ratios [49]. Relative risk is simply a translation of the scale of risk, without regard to location on that scale. Absolute risk integrates this with the totality of other risk factors.

Nihilism Versus Predictability

One of the important advantages of using equations such as the logistic equation is that they can be used to predict results for either groups of patients or individual patients. We recognize that when speaking of individual patients, we are referring to a prediction concerning the probability of events for that patient; we generally cannot predict exactly who will experience an event or when an event will occur. Of course, the nihilist will say, “You can’t predict.” A doctor cannot treat patients if he or she is a nihilist and believes that there is no way to predict if therapy will have an effect. Thus, although we do not want to over-interpret predictions from logistic models, we nevertheless believe the predictions made are better than “seat of the pants” guesses.

Parsimony Versus Complexity

Although clinical data analysis methods and results may seem complex at times, as in the large number of risk factors that must be assessed, an important philosophy behind such analysis is parsimony (simplicity). There are good reasons to embrace parsimony to an extent. One is that clinical data contain inherent redundancy, and one purpose of multivariable analysis is to identify that redundancy and thus simplify the dimensionality of the problem. A corollary is that there are likely variables that just introduce noise, and what we want to find is real signal. A second reason is that assimilation of new knowledge is incomplete unless one can extract the essence of the information. Thus, clinical inferences are often even more digested and simpler than the multivariable analyses. We must admit that simplicity is a virtue based on philosophic, not scientific, grounds. The concept was introduced by William of Ocken in the early fourteenth century as a concept of beauty—beauty of ideas and theories [50]. Nevertheless, it is pervasive in science. There are dangers associated with parsimony and beauty, however. The human brain appears to assimilate information in the form of models, not actual data [51]. Thus, new ideas, innovations, breakthroughs, and novel interpretations of the same data often hinge on discarding past paradigms (thinking “outside the box”) [52]. There are other dangers in striving for simplicity. We may miss important relations because our threshold for detecting them is too high. We may reduce complex clinical questions to simple but inadequate questions that we know how to answer.

Nomograms

One of the most powerful tools to understand the relationship of incremental risk factors to surgical failure is graphics [53]. An important reason for our using and even developing completely parametric models such as the logistic model was that they so easily allow graphics to be generated in the form of nomograms, as advocated by the Framingham investigators [49]. For example, if an analysis indicates an association of survival with young age, we want to know what the shape of that relationship is. Is it relatively flat for a broad range of age and then rapidly increasing at neonatal age? Or does risk increase or decrease rather linearly with age? Although the answers to these questions are contained within the numbers on computer printouts, these numbers are not easily assimilated by the mind. However, they can be used to demonstrate graphically the shape of the age relation with all other factors held constant.

Because graphics are so powerful in the process of generating new knowledge, an important responsibility is placed on the statistician to be sure that relations among variables are correct. Often, variables are examined and statistical inferences made simply to determine whether a continuous variable is a risk factor, without paying particular attention to what the data convey regarding the shape of the relationship to outcome. Instead, the statistician needs to focus during analysis on linearizing transformations of scale that may be needed to faithfully depict the relationship. Our experience indicates that most relations of continuous variables with outcome are smooth. They do not show sharp cut-offs, something we think investigators should be discouraged to look for.

Effectiveness, Appropriateness, Timing, and Indications for Intervention

Our initial focus was on surgical success and surgical failure (safety), but we soon began to investigate the effectiveness, appropriateness, and timing of intervention. The concept evolved that, only after we knew about the safety, effectiveness, long-term appropriateness, and optimal timing of possibly alternative interventions versus the natural history of disease, would we be able to define indications for intervention. This was subsequently embodied in the organization of each chapter of the Kirklin/Barratt-Boyes text Cardiac Surgery [45]. This was backward to the usual surgical thinking of the time, which began at indications rather than ended there.

What we immediately realized was that for most congenital heart lesions, knowledge of natural history was scant. In seeking sources of that information, we were faced with time-related mortality data in multiple different formats. Some data were typical right-censored (meaning that we knew the time of death—uncensored observations—and the time of follow-up of persons still surviving—censored observations). Others were presented as temporal blocks of data and counts of deaths (eg, died within first year, first year to age 5, 5–10, and so forth). Statisticians call this interval-censored count data. Yet others came from census data for which we knew nothing about deaths, only about patients in various time frames who were still alive (cross-sectional censored data). Dr. Kirklin was aware of, and had himself performed the calculations, for Berkson’s life table method [44, 54], and I had worked for Dr. Paul Meier of Kaplan-Meier fame using the product-limit method [55]. But this heterogeneous type of data necessitated forging new collaboration with an expert in such matters, Dr. Malcolm Turner. He indicated to us that our problem went deeper than just the data: We needed to figure out how we would manipulate those data once we had answers to the natural history question. It was his belief that we once again needed to formulate equations that could be mathematically manipulated to identify, for example, optimal timing of intervention. Thus began a decade-long quest for a parametric model of survival, culminating in May 1983.

By that time two important things had happened. First, D. R. Cox in the United Kingdom had proposed a proportional hazards, semi-parametric approach to multivariable analysis of survival data [56]. Dr. Naftel visited him, showed him many of the survival curves we had generated, and asked for his advice. To our dismay, he responded that it was highly unlikely that the idea of proportional hazards was appropriate for the intervention data we showed him. Immediately after surgery mortality was elevated, and he opined that risk factors for this likely were very different from those pertaining to long-term mortality. Second, he thought the curves could be characterized as being of “low order” (that is, they could be described by a model with few parameters to estimate). Third, he believed that a fully parametric model is what we should be looking for so it could be easily manipulated not only for display of results, but for use in determining optimal timing of operation. Finally, he thought our group had enough mathematical firepower to figure this all out.

The second event was failure of the Braunwald-Cutter valve [57]. Advice was sought from all quarters, including industry, on how to analyze the data and possibly make the tough and potentially dangerous decision to remove these prostheses. This brought us into contact with Wayne Nelson, a General Electric statistician who was consulting for Alabama Power Company. He introduced us to the cumulative hazard method for calculating the life table, which brought several advantages [58]. First, it could be used to analyze repeatable events, such as repeat hospitalizations and adverse events. Second, each event could be coupled with what he called a cost value, such as severity of a non-fatal event, e.g., a stroke [59, 60]. Third, we needed to consider the competing risk of death as we thought about potentially non-fatal events.

Thus, in developing a comprehensive model for time-related events, of necessity we knew we had to take into account simultaneously the multiple formats the data might come in, repeatable events, weighting applied to these events, competing risks, and mathematical manipulation of all these.

Time-Related Events

Time-related events are often analyzed by a group of methods commonly called “actuarial.” The word actuarial comes from the Latin actuarius, meaning secretary of accounts. The most notable actuarius was the Praetorian Prefect Domitius Ulpianus, who produced a table of annuity values in the early third century AD [4]. With emergence of both definitive population data and the science of probability, modern actuarial tables arose, produced first by Edmund Halley (of comet fame) in 1693 [61]. He was motivated, as was Ulpianus, by economics related to human survival, because governments sold annuities to finance public works [4]. Workers in this combined area of demography and economics came to be known as actuaries in the late eighteenth century. In the nineteenth century the actuary of the Alliance Assurance Company of London, Benjamin Gompertz, developed mathematical models of the dynamics of population growth to characterize survival [62]. In 1950, Berkson and Gage published their landmark medical paper on the life-table (actuarial) method for censored data, which they stated was no different from that used by others as early as at least 1922 [44, 54]. In 1952, Paul Meier at Johns Hopkins University and, in 1953, Edward Kaplan at Bell Telephone Laboratories submitted to the Journal of the American Statistical Association a new method for survival analysis, the product-limit method, that used more of the data. Estimates were generated at the time of each occurrence of an event. Further, the basis for the estimates was grounded in sound statistical theory. The journal editor, John Tukey, believed the two had discovered the same method, although presented differently, and insisted they join forces and produce a single publication. For the next 5 years, before its publication in 1958 [55], the two hammered out their differences in terminology and thinking, fearing all the while they would be scooped. The product-limit method (usually known as the Kaplan-Meier method), after considerable delay awaiting the advent of high-speed computers to ease the computation load, became the gold standard of nonparametric survival analysis. Until 1972, only crude methods were available to compare survival curves according to different patient characteristics [6370]. The introduction by Cox of a proposal for multivariable survival analysis based on a semi-parametric proportional hazard method revolutionized the field [56].

Unlike nonparametric and even semi-parametric survival estimation based on counting (martingale) theory, model-based or parametric survival estimation arose out of biomathematical consideration of the force of mortality, the hazard function [71]. The hazard function was a unidirectional rate value or function that transported, as it were, survivors to death with the same mathematical relations as a chemical reaction (compartmental theory). This idea arose during the Great Plague of the sixteenth century. John Graunt, a haberdasher, assumed a constant risk of mortality (the mortality rate or force of mortality), which generates an exponentially decreasing survival function (as does radioactive decay). He called this constant unidirectional rate the hazard function after a technical term for a form of dicing that had by then come into common usage to mean “calamity” [71]. Because a constant hazard rate presumes a mathematical model of survival, his was a parametric method. Today, this expression of hazard is called the linearized rate.

Although linearized rates have been used to characterize time-related events after cardiac surgery, particularly by regulatory agencies, it is uncommon for hazard to be constant [72]. A challenge in devising, however, a time-varying parametric hazard model was that we often had only a small portion of the complete survival curve, such as 5- or 10-year survival after repair of a ventricular septal defect. The approach we finally figured out in the spring of 1983 was a temporal decomposition, much like putting light through a prism and depicting its colors [73]. Each component of the decomposition dominated a different time frame and could be modulated by its own set of risk factors, all estimated simultaneously.

Repeatable Events

Unlike death, morbid events such as thromboembolism or infection after transplantation may recur. The most common method of analysis is to focus only on its first occurrence, ignoring any further information beyond that point for the patients experiencing the event. However, true repeated-events analysis can be performed using the Nelson estimator. Basically, patients are not removed from the patients at risk after they experience the event. Thus, they are at risk of it again after experiencing it. A special case of repeated events is the industrial method known as “modulated renewal” [74]. The idea behind a modulated renewal process is that the industrial machine (or patient) is restarted at a new time zero each time the event occurs. This permits (1) ordinary Kaplan-Meier methods to be used, (2) the number of occurrences and intervals between each recurrence to be used in multivariable analyses, and (3) change in patient characteristics at each new time zero to be used in analyses. Thus, if the modulated renewal assumption can be shown to be valid, it increases the power and utility of the analysis tremendously.

Competing Risks

Competing risks analysis is a method of time-related data analysis in which multiple, mutually exclusive events are considered simultaneously [75, 76]. It is the simplest form of continuous-time Markov process models of transition among states [77]. In this simplest case, patients make a transition from an initial state (called event-free survival) to at most one other state that is considered to be terminating. Rates of transition from the initial state to one of the events (called an end state) are individual, independent functions.

Analysis of a single time-related event is performed in isolation of any other event. This is ideal for understanding that specific phenomenon. In contrast, competing risks analysis considers multiple outcomes in the context of one another. It is thus an integrative analysis.

In the early eighteenth century, some progress was made in the war against smallpox by inoculating people with small doses of the virus to establish immunity to the full-blown disease. Because governments at that time were supported in part by annuities, it was of considerable economic importance to know the consequences a cure of smallpox might bring upon the government’s purse. Daniel Bernoulli tackled this question by classifying deaths into mutually exclusive categories, one of which was death from smallpox [78]. For simplicity, he assumed that modes of death were independent of one another. He then developed kinetic equations for the rate of migration from the state of being alive to any one of several categories of being dead, including from smallpox. He could then compute how stopping one mode of death, smallpox, would influence both the number of people still alive and the redistribution of deaths into the other categories. (The triumph of the “war on smallpox” came in 1796, just 36 years after his publication).

Weighted Events

As noted in previous text, once one thinks “out of the box” beyond probability theory, one can begin to imagine that any non-fatal event could be characterized not only as having occurred, but with a “cost” associated with it. This might be actual cost of a medical readmission, for example [79], or length of stay, or a functional health assessment metric.

Longitudinal Data Analysis

Today, we look beyond occurrences of clinical events. How does the heart morphology and function change across time? How does a patient’s functional health status change across time? How often does supraventricular tachycardia occur? What are the variables that modulate these longitudinal values? Importantly, do they influence clinical events? This is today’s frontier of statistical methods.

Severe technologic barriers to comprehensive analysis of longitudinal data existed before the late 1980s [80]. Repeated-measures analysis of variance for continuous variables had restrictive requirements, including fixed time intervals of assessment and no censored data. Ordinal logistic regression for assessment of functional status was useful for assessments made at cross-sectional follow-up [81, 82], but not for repeated assessment at irregular time intervals with censoring. In the late 1980s, Zeger and his students and colleagues at Johns Hopkins University incrementally, but rapidly, evolved the scope, generality, and availability of what they termed “longitudinal data analysis” [83]. Their methodology accounts for correlation among repeated measurements in individual patients and variables that relate to both the ensemble and the nature of the variability. Because average response and variability are analyzed simultaneously, the technology has been called “mixed modeling.” The technique has been extended to continuous, dichotomous, ordinal, and polytomous outcomes using both linear and nonlinear modeling.

Because of its importance in many fields of investigation, the methodology acquired different names. In 1982, Laird and Ware published a random effects model for longitudinal data from a frequentist school of thought [84]. In 1983, Morris presented his idea on empirical Bayes from a Bayesian school of thought [85]. In the late 1980s, members of Zeger’s department at Johns Hopkins University developed the generalized estimating equation (GEE) approach [83]. Goldstein’s addition to the Kendall series in 1995 emphasized the hierarchical structure of these models [86]. His is a particularly apt description. The general idea is that such analyses need to account for covariables that are measured or recorded at different hierarchical levels of aggregation. In the simplest cases, time is one level of aggregation, and individual patients with multiple measurements is another. These levels have their corresponding parameters that are estimated, and each may require different assumptions about variability (random versus fixed- effects distributions). Except under exceptional circumstances, these techniques have replaced former restrictive varieties of repeated-measures analysis, which we now consider of historical interest except for controlled experiments designed to exactly meet their assumptions.

Using the same strategy and mathematical formulation that Naftel, Blackstone, and Turner did for time-related events [73], we have introduced a longitudinal data analysis method by which the temporal occurrence of a binary event, such as presence or absence of atrial fibrillation, is conceived as the addition of a number of temporal components, or phases. Each phase is modulated simultaneously by a log-linear additive function of risk factors. However, like all current methods, there is only primitive built-in capability for selecting variables for modulating the temporal components. Therefore, with a number of our colleagues and funding from the National Institutes of Health, we are actively developing new comprehensive methods for longitudinal data analysis.

Comparison of Treatments

Clinical Trials with Randomly Assigned Treatment

Controlled trials date back at least to biblical times, when casting of lots was used as a fair mechanism for decision-making under uncertainty (Numbers 33:54). An early clinical trial of a high protein vs. high calorie diet took place in the Court of Nebuchadnezzar, king of Babylon (modern Iraq). The first modern placebo-controlled, double-blinded, randomized clinical trial was carried out in England by Sir Austin Bradford Hill on the effectiveness of streptomycin versus bed rest alone for treatment of tuberculosis [87], although seventeenth and eighteenth century unblinded trials have been cited as historical predecessors [8890]. Multi-institutional randomized clinical trials in pediatric and congenital heart disease have been championed by the Pediatric Heart Network over the last decade.

Randomization of treatment assignment has three valuable and unique characteristics:

  • It eliminates selection factors (bias) in treatment assignment (although this can be defeated at least partially by enrollment bias).

  • It distributes patient characteristics equally between groups, whether they are measured or not, known or unknown (balance), a well-accepted method of risk adjustment [9194].

  • It meets assumptions of statistical tests used to compare end points [93].

Randomized clinical trials are also characterized by concurrent treatment, excellent and complete compilation of data gathered according to explicit definitions, and proper follow-up evaluation of patients. These operational by-products may have contributed nearly as much new knowledge as the random assignment of treatment.

Unfortunately, it has become ritualistic for some to dismiss out of hand all information, inferences, and comparisons relating to outcome events derived from experiences in which treatment was not randomly assigned [95]. If this attitude is valid, then much of the information now used to manage patients with congenital heart disease would need to be dismissed and ignored!

Clinical Studies with Nonrandomly Assigned Treatment

Clinical studies with nonrandomly assigned treatment produce little knowledge when improperly performed and interpreted. Because this is often the case, many physicians have a strong bias against studies of this type. However, when properly performed and interpreted, and particularly when they are multi-institutional or externally validated, clinical studies of real-world experience can produce secure knowledge. During the 1980s, federal support for complex clinical trials in adult heart disease was abundant. Perhaps as a result, few of us noticed the important advances being made in statistical methods for valid, nonrandomized comparisons, now called “comparative effectiveness studies.” One example was the seminal 1983 Biometrika paper “The Central Role of the Propensity Score in Observational Studies for Causal Effects,” by Rosenbaum and Rubin [96]. In the 1990s, as the funding climate changed, interest in methods for making nonrandomized comparisons accelerated [97]. This interest has accelerated further in the twenty-first century.

Apples-to-apples nonrandomized comparisons of outcome can be achieved, within certain limitations, by use of so-called balancing scores, of which the propensity score is the simplest [96]. Balancing scores are a class of multivariable statistical methods that identify patients with similar chances of receiving one or the other treatment. Perhaps surprisingly, even astonishingly, patients with similar balancing scores are well balanced with respect to at least all patient, disease, and comorbidity characteristics taken into account in forming the balancing score. This balancing of characteristics permits the most reliable nonrandomized comparisons of treatment outcomes available today [98].

The essential approach to a comparison of treatment outcomes in a nonrandomized setting is to design the comparison as if it were a randomized clinical trial and to interpret the resulting analyses as if they emanated from such a trial. This essential approach is emphasized in Rubin’s 2007 article, “The Design Versus the Analysis of Observational Studies for Causal Effects: Parallels with the Design of Randomized Trials [99]. As noted by Rubin, “I mean all contemplating, collecting, organizing, and analyzing data that takes place prior to seeing any outcome data.” He emphasizes by this statement his thesis that a nonrandomized set of observations should be conceptualized as “a broken randomized experiment…with a lost rule for patient allocation, and specifically for the propensity score, which the analysis will attempt to construct.” For example, the investigator should ask, “Could each patient in all comparison groups be treated by all therapies considered? If not, this constitutes specific inclusion and exclusion criteria. If this were a randomized trial, when would randomization take place? One must only use variables to construct a propensity score that would be known at the time randomization would have occurred, not after that; this means that variables chosen in the propensity score analysis are not those that could possibly be affected by the treatment.”

The most common use of the propensity approach is to match pairs of patients on the basis of their propensity score alone. Outcomes can then be compared between groups of matched pairs. However, just as in a randomized trial, the results will be applicable to patients who match the characteristics of the propensity groups.

Where Have We Been and Where Are We Headed?

Analysis, as expressed by Sir Isaac Newton, is that part of an inductive scientific process whereby a small part of nature (a phenomenon) is examined in the light of observations (data) so that inferences can be drawn that help explain some aspect of the workings of nature [100].

Philosophies underpinning methods of data analysis have evolved rapidly since the latter part of the nineteenth century and may be at an important crossroad. Stimulated in large part by the findings of his cousin Charles Darwin, Sir Francis Galton, along with Karl Pearson and Francis Edgeworth, established at that time what has come to be known as biostatistics [101]. Because of the Darwinian link, much of their thinking was directed toward an empirical study of genetics versus environmental influence on biological development. It stimulated development of the field of eugenics (human breeding) [102] and the study of mental and even criminal characteristics of humans as they relate to physical characteristics (profiling). The outbreak of World War I led to development of statistics related to quality control. Sir Ronald Fisher formalized a methodologic approach to experimentation, including randomized designs [103]. The varying milieus of development led to several competing schools of thought within statistics, such as frequentist and Bayesian, with different languages and different methods [104]. Formalization of the discipline occurred, and whatever the flavor of statistics, it came to dominate the analytic phase of inferential data analysis, perhaps because of its empirical approach and lack of underlying mechanistic assumptions.

Simultaneously, the discipline of biomathematics arose, stimulated in particular by the need to understand the growth of organisms (allometric growth) and populations in a quantitative fashion. Biomathematicians specifically attempt to develop mathematical models of natural phenomena such as clearance of pharmaceuticals, enzyme kinetics, and blood flow dynamics. These continue to be important today in understanding such altered physiology as cavopulmonary shunt flow [105]. Many of the biomathematical models came to compete with statistical models for distribution of values for variables, such as the distribution of times to an event.

Advent of the fast Fourier transform in the mid-1960s [106] led to important medical advances in filtering signal from noise and image processing. The impetus for this development came largely from the communications industry, so only a few noticed that concepts in communication theory coincided with those in statistics and mathematics.

As business use of computers expanded, and more recently as genomic data became voluminous, computer scientists developed methods for examining large stores of data [107]. These included data mining in business and computational biology and bioinformatics in the life sciences. Problems of classification (such as of addresses for automating postal services) led to such tools as neural networks [16], which have been superseded in recent years by an entire discipline of machine learning [107, 108].

In the past quarter century, all these disciplines of mathematics, computer science, information modeling, and digital signal processing have been vying for a place in the analytic phase of clinical research that in the past has largely been dominated by biostatistics. Specifically, advanced statistics and algorithmic data analysis have conquered the huge inductive inference problem of disparity between number of parameters to be estimated and number of subjects (e.g., in genetics, hundreds of thousands of variables for n = 1) [109]. Advanced high-order computer reasoning and logic have taken the Aristotelian deterministic approach to a level that allows intelligent agents to connect genotype with phenotype [110]. It may be rational to believe that the power of these two divergent approaches to science can be combined in such a way that very “black-box” but highly predictive methods can be explored by intelligent agents that report the logical reasons for a black-box prediction [52].

Fortunately for those of us in cardiac surgery, we need not be threatened by these alternative voices, but rather can seize the opportunity to discover how each can help us understand the phenomena in which we are interested.