Introduction

Many epidemiologists may think that statistical regression is the only modeling technique available for the epidemiologist’s toolkit, but statistical models are only one of several types of analytic models that are valuable to the discipline. Spatial models make use of geographic information systems; ecological models explain population dynamics; physiological models describe cellular functions; and forecasting, simulation, and cost-benefit analyses enhance public policy decision-making. In each of these cases, models further our understanding of how social, biological, and environmental processes impact health and disease in populations.

Mathematical modeling is a set of techniques, tools, and equations that can be tailored to particular disciplines. In epidemiology, mathematical models usually define interactions between individuals or populations and other individuals, populations, or environments. By defining the rules that describe these interactions and translating those rules into equations, a complex set of processes can be broken down into components and quantified. The model can then be used to explore relationships in the modeled population, to test the impact of changed rules on the system and its components, and to examine the outcomes of various events that might have an effect on a population.

Despite these many potential uses, mathematical models are, at present, used infrequently by epidemiologists. However, modeling has already made significant contributions to the health sciences (including both clinical medicine and public health) and related disciplines, including biology, mathematics, statistics, bioinformatics, and other fields [1]. A summary of some of these areas of research is highlighted in Fig. 1. An increased familiarity with the many ways that mathematical models can be used in epidemiological research will allow models to be used more extensively and correctly, to be accessible to a broader range of epidemiologists, and to receive more critical examination. This paper describes the great variety of uses for mathematical models within the field of epidemiology and provides an overview of the methods of modeling.

Fig. 1
figure 1

Examples of uses of models in epidemiology

The epidemiological research process can be considered to have four key steps (Fig. 2): (1) identifying study questions, (2) designing studies and collecting data, (3) analyzing data, and (4) applying research findings to public health. The mathematical modeling process follows four corresponding steps: (1) selecting key components for the model, (2) identifying and validating the inputs that will go into the model, (3) running the model, and (4) interpreting outputs and explaining the applications of the model results. In the following four sections, we describe the applications of models to epidemiology and introduce some of the principles and techniques of modeling. Susceptible-infectious-removed (SIR) models, commonly used in infectious disease epidemiology to describe infection transmission dynamics, are used as a primary example, and additional model-based studies from many epidemiological disciplines are provided as supplemental illustrations.

Fig. 2
figure 2

Framework for the application of mathematical models throughout the epidemiological research process

Identifying study questions

The first step in any research project is to identify the questions that will be explored. For new studies, this may involve conducting a community needs assessment. For ongoing projects, this may take the form of a program evaluation in which several possible next steps that could be implemented are evaluated. For all studies, this step typically involves consulting the existing literature to identify what topics have previously been explored and to catalogue the gaps that remain to be filled. Some researchers find it helpful to create a simple sketch of the populations of interest, the exposures that will be examined, the relationships between these populations and/or exposures, and possible causal pathways for disease processes. This type of visual expression of what is and is not understood about a complex system can be a first step toward building a mathematical model.

For example, infection transmission dynamics can be represented using an SIR model such as the one shown in Fig. 3. SIR models are among the most commonly used models in epidemiology, and serve as a good introduction to the modeling process. In this simple model, every individual in a population is assigned to one of the three compartments: S (susceptible) for individuals at risk of infection, I (infected/infectious) for individuals who are currently infected, or R (recovered/removed) for individuals who have recovered from the infection and have immunity.

Fig. 3
figure 3

Sample SIR model

Realism can be added to the model by making it more complex (Fig. 4). If an SIR model describes changes that would be expected to occur in a population over decades or longer time periods, realism can be added by building population dynamics into the model so that susceptible individuals are “born” into the population and older adults “die” and are removed from the population. (One of the advantages of modeling is the ability to “observe” several generations’ worth of data in mere minutes.) The death rate might be higher for individuals in box I, and that increased risk could be represented by an extra arrow out of the box for death due to infection. Other arrows, which represent the flow of individuals from one compartmental classification to another or flow into a population due to birth or out of a population due to death, could represent public health interventions, such as a new vaccine allowing individuals to move directly from the S box to the R box.

Fig. 4
figure 4

More complex SIR model

Additional compartments could be added to represent age groups, sex groups, income groups, or other exposure categories. For example, an SIR model with two age groups, child and adult, would need six compartments: S, I, and R boxes for children and S, I, and R boxes for adults. More complex models might include separate boxes for males and females, genetic characteristics, behavioral risk factors, or other exposures, and might require hundreds of compartments. If the infection being studied is vector-borne, the model can incorporate information about the insect vector, the life cycle of the pathogen, and human behavior and biology [2, 3]. If the disease of interest is a chronic condition, a model can incorporate information about the progression of disease [4]. Multiple exposures and multiple outcomes can be included in a model. For example, a model of the impacts of various aspects of traffic—such as traffic volume, traffic speed, the presence of safe walking and bicycling zones, and amount of vehicle emissions—can investigate a variety of health outcomes, including respiratory and cardiovascular health, osteoporosis, mental health, and injuries [5].

Selecting the components that will be included in a model requires seeking a balance between simplicity and complexity. A model that is too simple—one that ignores critical components or relationships—will not clarify what happens in the real world. A model that is too complex is likely to be inaccurate due to the impossibility or impracticality of acquiring sufficiently detailed input data for a large number of compartments and parameters.

Whether or not a researcher intends to build and test a formal mathematical model, this kind of sketch—and this sort of “systems thinking” [6]—can clarify the relationships that the researcher wants to explore. This, in turn, may contribute to the framing of study questions, the assessment of the components and interactions that may influence a system, and the selection of variables that will be measured.

Designing studies

The second step in the research process is to design studies that will collect valid data. Models can contribute to the planning of a field study by assisting in the selection of a sampling strategy—for example, models may identify certain population groups that should be preferentially recruited based on demographic characteristics or exposure history—and in the estimation of the required sample size and study duration. Models also help to clarify the assumptions that are built into a study’s design, such as the assumption that infection confers long-term immunity or the assumption that patients with chronic diseases are fully compliant with treatment regimens. There is a feedback loop between field studies and models: data from field studies are used to create models that represent the real world, and models provide information about how to best measure real-life variables.

To understand this symbiotic relationship, one must understand the process for identifying and validating the inputs that will go into the model. For SIR models, this step involves quantifying the proportion of the population located in each compartment in a model and assigning values to the rates of flow between compartments. The arrows on Figs. 3 and 4 show the flow of individuals (or the flow of a proportion of the population in each compartment) from S to I and from I to R over time. These rates are often represented using Greek letters, such as lambda (λ) for the infection rate and rho (ρ) for the recovery rate. When possible, the values for parameters like λ and ρ are estimated from real-life data, although sometimes they must be merely educated guesses. Clearly explaining the source of these values and how closely they estimate real-life values is an important part of justifying the validity and applicability of a model.

If constant values are assigned to all parameters of an SIR model, then it is said to be deterministic and the exact same output will result every time the model is run (assuming that the same model structure, equations, and parameter values are used). If a probability distribution is assigned to these parameters to better capture the uncertainty of the estimate, then the model is said to be stochastic, which means that the output will vary each time the model is run. Stochastic models are usually run thousands of times so that the probability distribution of outputs can be examined.

A key step in testing a model is sensitivity analysis, which determines how much each model component contributes to the output of the whole model. This process is an important contribution of modeling to the design of valid field studies. Parameters that are highly sensitive strongly influence the outcomes of the model, and, presumably, real-life outcomes. Sensitive variables must be measured very carefully. Other parameters may have almost no impact on the outcome of a model. These variables may not even need to be included in data collection, although it is important to err on the side of caution when using models to identify key variables, since the results of sensitivity testing are dependent on the model structure and assumptions, the definition of parameters, and existing data. Still, the identification of parameters that appear to be highly sensitive can be critical to the development of valid data collection procedures.

Once the framework for study design and sampling has been developed, models can be used to explore the balance between sample size and statistical power or to determine whether the proposed study can be completed within budget and time constraints. In some cases, models may show that a field study is unlikely to produce meaningful results. For example, it may not be practical to conduct a study if a large population with special attributes is required and known to be unavailable, or if the most important variables are difficult or impossible to measure accurately and reliably. In situations in which a model shows that a field study is impractical, mathematical models may be able to replace field studies by generating output based on information that is already available. Replacements for field studies may be essential when time constraints or ethical concerns prevent a trial from being conducted. For example, a model of foodborne disease outbreaks based on past outbreaks can be used to estimate the effects of changes in human and pathogen behavior on population health rather than waiting to see what the outcomes of a particular emergent threat are before updating policy recommendations [7].

Analyzing data

Once a field study has been implemented and data have been collected, the third stage of the research process is data analysis. Whether or not data were collected with a mathematical model in mind, a model can be created or modified for use in interpretation of results and for causal analysis. After assigning values to compartments and parameters, a model can be run—the equations solved, usually as a function of time—and the outcome variables can be displayed visually on a computer monitor, usually in the form of graphs.

A first step in analyzing data with mathematical models is to use model representations to simplify complex data sets into manageable relationships and pathways to explore. For example, in an SIR model equations are used to define the change in the number of people (or the proportion of the total population) in each compartment during a certain time period. In the model shown in Fig. 3, there is only one arrow leaving the S box. The equation for the change in this compartment over time is written as dS/dt = −λS, which says that the change in S per one unit of time is to lose individuals from the S box at a rate of λ. The equations for the other compartments are dI/dt = +λS−ρI and dR/dt = +ρI. Individuals who leave box S (−λS) enter box I (+λS), and individuals who leave box I (−ρI) enter box R (+ρI). More complex models (such as the one in Fig. 4) may require more complex equations. For example, if the infection rate is found to be related to the proportion of infectious individuals in the population at a given time, which is I/(S+I+R), it would be more accurate to have the equation for the flow out of the S box to be dS/dt = −λS(I/(S+I+R)).

It is also possible to define how different types of individuals relate to one another using structured (or preferential) mixing equations that describe how certain individuals or populations interact with one another. For example, these equations might specify that a child is more likely to have contact with another child than with an adult or that individuals who engage in high-risk behavior (such as unprotected sexual intercourse) are more likely to engage in risky behavior with other high-risk individuals than with low-risk individuals. An even more complex model may require moving beyond compartments to individual-based simulation in which each simulated individual in the population has a personal history and a special set of rules that define how that individual interacts with other individuals and with the environment. Data can be used to refine the model components and pathways and to fill in population counts and rates of interaction. Conversely, models can be used with existing data to fill in gaps in knowledge.

Models also allow for fuller use of existing data and the concurrent analysis of data from a variety of sources. For example, models have been used to estimate the incidence of hepatitis A virus infections based on seroprevalence data from more than one hundred field studies from around the world [8] and to analyze HIV transmission dynamics in populations of injecting drug users by combining surveillance information with testing of needles used in exchange programs [9]. Other models have compiled information about pathology, immunology, and epidemiology into one model of the causes of influenza outbreaks [10]; combined bacteriological, pharmacological, and treatment information into an analysis of antibiotic resistance risks in hospitals [11]; and incorporated longitudinal data on household socioeconomic status and family violence into a model of mental health [12]. As models in epidemiology and other fields refer to and refine each other, data collected by epidemiologists becomes even more valuable for understanding population health and predicting changes in public health status.

Applying findings to public health

A typical final step in the epidemiologic research process is to identify the lessons learned from a study, which often takes the form of suggesting possible public health interventions based on the results of a field study, proposing appropriate policy measures to address public health concerns, or recommending future areas of research. Models can contribute to all three of these functions. Some of the first models in epidemiology were developed in 1760 by Bernoulli in order to promote the benefits of smallpox vaccination [13], and applied models remain popular today as tools for persuasion and enhanced decision-making.

Models can be useful to both scientists and policymakers, and are helpful for demonstrating the value of public health programs to stakeholders. For example, a model of HPV vaccination in Finland that compared the effectiveness of vaccinating different populations at different ages determined that programs targeting females alone were almost as effective as programs for both sexes [14], while an evaluation of a possible HPV vaccination program for 12-year-old girls in the United States determined that the proposed program would have somewhat higher cost than existing childhood vaccination programs but would provide a similarly high benefit [15, 16]. Other studies have used surveillance data to predict the effects of policies or programs on the incidence and prevalence of other sexually transmitted infections in the general population [1719]; the SimSmoke simulation model uses assessments of the impact of past tobacco control policies to predict the impact of new policies on smoking prevalence in the future [20]; the BOLD model feeds data collected under rigorous standards at sites around the world into a model of the burden of chronic obstructive pulmonary diseases [21]; DISMOD II, a program available through the World Health Organization, checks the internal validity of burden of disease estimates [22]; the Prevent model examines the impact of risk factors on chronic disease [23]; and studies exploring the best ways to allocate health resources have, for example, examined the relative impacts of resources used for preventing the onset of chronic diseases versus preventing the complications of existing cases [24]. Models have also been used to identify high-risk populations, and to predict the impact of demographic shifts or behavioral changes on disease incidence and prevalence.

Models using data collected during a study or intervention can inspire further related interventions, trigger investigation of outcomes that are not understood, lead to changes in an intervention effort as it progresses, and provide reassurance that intervention programs are on track for success. For example, models were used to improve the Onchocerciasis Control Program in West Africa mid-stream. The effectiveness of the expensive, large-scale program to reduce the black fly vector that transmitted the parasite that causes onchocerciasis (also known as river blindness) was questioned when nearly a decade into the program there was little change in the prevalence of onchocerciasis in the treatment area. A model of the decreasing intensity of infections based on data collected during the intervention showed that continuing the program could lead to the elimination of onchocerciasis from the study area in just five additional years. The model proved to be correct [25], and the OnchoSim program is now being used for surveillance and planning in other regions of Africa [26].

Other studies have compared the projected health impacts of various types of interventions based on collected data. For example, a dynamic population model used to explore the relative outcomes of various types of smoking cessation interventions found that minimal counseling by a physician was the most cost-effective way to reduce tobacco use, but it was responsible for only a small portion of those who quit smoking; intensive counseling plus use of a pharmaceutical smoking cessation aid was more expensive, but was significantly more effective [27]. These applications of study results to public health show that the research process rarely ends at this stage, but instead flows naturally back to the identification of new questions to explore.

Conclusion

In the cycle of epidemiological research, mathematical models can provide many benefits, such as simplifying and presenting complex information, evaluating the significance of variables, performing additional analysis on data, and forecasting outcomes for a project or population (Fig. 2). The publication of epidemiological models can be of great benefit to the epidemiological community when researchers describe their frameworks, assumptions, analyses, and interpretations in clear and quantifiable terms.

At present, one of the main challenges to the expanded use of mathematical models in epidemiology is the limited pool of epidemiologists with the advanced mathematical training required to design and conduct high-level analysis. This impediment can be substantially alleviated by expanding collaborative research with experts in related disciplines, such as computer science, mathematics, bioinformatics, geography, and engineering. Epidemiology will benefit from more and broader collaborations, and interdisciplinary work will contribute to the development and application of both new tools and novel uses for existing analytic techniques. A related concern is the need for epidemiological modelers to clearly explain both the outcomes and the limitations of their work to the public, to politicians, and to public health professionals. As the number of epidemiologists comfortable with the use and interpretation of models grows, the number of researchers able to effectively communicate this information will also increase. This will enable researchers to make even fuller use of mathematical models during all stages of the epidemiologic research process.